Simple case: Analysis of house buying groups - DataWorks

This topic guides you through house buying group analysis to help you quickly understand the data development and data analysis processes in DataWorks.

Introduction

This tutorial analyzes house buying situations of different groups based on house buying data. In this tutorial, on-premises data is uploaded to a MaxCompute table named bank_data by using DataWorks, and MaxCompute SQL nodes are used to analyze user groups to obtain a table named result_table. Finally, simple visualization and analysis are performed based on result_table to obtain user group profile data.

Note

This tutorial demonstrates DataWorks features based on simulated data. In actual business scenarios, you must make adjustments based on your business data.

The following figure shows the data forwarding path and data development process in this tutorial.

After data analysis is complete, you can obtain the following group analysis profile from the house buying data: The education levels of single people who have mortgages are mainly university.degree and high.school.

Preparations

Activate DataWorks

This tutorial describes how to get started with DataWorks in the China (Shanghai) region. You must log on to the DataWorks console and switch to the China (Shanghai) region to check whether DataWorks is activated in this region.

Note

In this tutorial, DataWorks is activated in the China (Shanghai) region. In an actual business scenario, you can determine the region in which you want to activate DataWorks based on the location of your business data.

If your business data is stored in an Alibaba Cloud service other than DataWorks, you must select the region in which the Alibaba Cloud service is activated.
If your business data is stored on an on-premises machine, you must access the business data over the Internet. In this case, you must select a region that is close to your geographical location to reduce access latency.

Activate DataWorks for the first time

If you are a new user, the following information will be displayed after you log on to the DataWorks console. The information indicates that DataWorks is not activated in the current region. You must click Purchase Product Portfolio for Free to activate DataWorks.

On the buy page, configure the parameters that are described in the following table.

Parameter	Description	Example
Region	Select the region in which you want to activate DataWorks.	China (Shanghai)
DataWorks Edition	Select the DataWorks edition that you want to purchase. Note In this tutorial, Basic Edition is selected. You can experience the features that are used in this tutorial in any DataWorks edition. For information about how to select an appropriate DataWorks edition based on your business requirements, see the Feature comparison section in the "Differences among DataWorks editions" topic.	Basic Edition

Click Confirm Order and Pay to complete the subsequent payment.

Activated but expired

If you have previously activated DataWorks in the China (Shanghai) region but the DataWorks edition has expired, the following prompt will appear. You must click Purchase Edition to purchase a DataWorks edition again.

On the purchase page, configure the parameters that are described in the following table.

Parameter	Description	Example
Edition	Select the DataWorks edition that you want to purchase. Note In this tutorial, Basic Edition is selected. You can experience the features that are used in this tutorial in any DataWorks edition. For information about how to select an appropriate DataWorks edition based on your business requirements, see the Feature comparison section in the "Differences among DataWorks editions" topic.	Basic Edition
Region	Select the region in which you want to activate DataWorks.	China (Shanghai)

Click Buy Now to complete the subsequent payment.

Important

If you cannot find the DataWorks edition that you purchased, you can perform the following operations:

Wait a moment and refresh the page. Data update may have a latency.
Check whether the current region is the same as the region in which you purchased the DataWorks edition. If the current region is different from the region in which you purchased the DataWorks edition, you cannot find the DataWorks edition.

Activated and not expired

If you have already activated DataWorks in the China (Shanghai) region, you will enter the DataWorks overview page after you log on to the DataWorks console. You can directly proceed to the next step.

Create a workspace

Go to the Workspaces page, switch to the China (Shanghai) region, and then click Create Workspace.
On the Create Workspace page, configure the Workspace Name parameter based on your business requirements, turn on Participate in Public Preview of Data Studio, and then click Create Workspace.
Note
After February 18, 2025, if you use an Alibaba Cloud account to activate DataWorks in the China (Shanghai) region for the first time and create a workspace, the Participate in Public Preview of Data Studio parameter is not displayed and new-version Data Studio is activated by default.

Create a resource group and associate it with the workspace

Go to the Resource Groups page, switch to the China (Shanghai) region, and then click Create Resource Group.

On the DataWorks Serverless Resource (PAYG) page, configure the following parameters.

Parameter	Description
Resource Group Name	Specify a name based on your business requirements.
VPC and vSwitch	Select an existing virtual private cloud (VPC) and an existing vSwitch. If no VPC or vSwitch exists in the current region, click the link in the parameter description to go to the VPC console to create one.
Service-linked Role	Follow the instruction to create a service-linked role. For information about the service-linked role, see DataWorks service-linked role.

Click Buy Now. Then, complete the payment.
Go to the Resource Groups page, switch to the China (Shanghai) region, find the created resource group, and then click Associate Workspace in the Actions column.
In the Associate Workspace panel, find the created DataWorks workspace and click Associate in the Actions column.

Create a MaxCompute project and associate the project with the workspace as a computing resource

This tutorial requires you to create a MaxCompute project and associate it with the workspace as a computing resource for receiving data and performing big data analysis in subsequent operations.

Go to the Workspaces page in the DataWorks console, switch to the China (Shanghai) region, find the created workspace, and then click the workspace name to go to the Workspace Details page.

In the left navigation pane, click Computing Resource. On the Computing Resource page, click Associate Computing Resource. In the Associate Computing Resource panel, click MaxCompute. In the Associate MaxCompute Computing Resource panel, configure the following key parameters to create a MaxCompute project and associate it with the workspace as a computing resource.

Note

You can retain default values for the parameters that are not described in the following table.

Parameter	Description
MaxCompute Project	Click the MaxCompute Project drop-down list and then click Create. In the Create Project dialog box, configure the following parameters: Project Name: Specify a name based on your business requirements. Make sure that the specified name is globally unique. Billing Method: Select Pay-as-you-go. Note If pay-as-you-go is unavailable, click Activate next to Pay-as-you-go to activate the MaxCompute service. Default Quota: Select an existing quota from the drop-down list.
Default Access Identity	Select Alibaba Cloud Account.
Computing Resource Instance Name	Specify a name for the computing resource. When you run tasks in subsequent operations, you can select a computing resource instance name to specify the computing resource that you want to use. In this tutorial, the computing resource is named `MaxCompute_Source`.

Click OK.

Procedure

In this tutorial, you need to use DataWorks to upload the test data provided for this tutorial to a MaxCompute project and create a workflow in Data Studio to cleanse and write the test data. In addition, you need to debug and run the workflow and execute SQL statements to query and verify the running result.

Step 1: Create a MaxCompute table

Before you upload the test data, you need to use the data catalog feature of DataWorks to create a table named bank_data in the MaxCompute project to store the test data to be uploaded.

Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane of the DataWorks console, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the Select Workspace drop-down list and click Go to Data Studio.
In the left-side navigation pane of the Data Studio page, click the icon to go to the Data Catalog pane.
(Optional) If your MaxCompute project is not added to DataWorks as a data catalog, click the icon next to MaxCompute to go to the DataWorks Data Sources tab. On the DataWorks Data Sources tab, add the MaxCompute project that is already added as a computing resource or data source to the workspace as a MaxCompute data catalog.
In the DATA CATALOG pane, click MaxCompute. In the MaxCompute section, create a MaxCompute table in the Table folder under the MaxCompute project that is added as a data catalog.
Note
- If the schema feature is enabled for the MaxCompute project, you must open the schema in the MaxCompute project before you can create a MaxCompute table in the Table folder under the MaxCompute project.
- In this tutorial, a workspace in standard mode is used, and debugging needs to be performed only in the development environment. Therefore, you need to create the bank_data table in the MaxCompute project only in the development environment. If you use a workspace in basic mode, you need to create the bank_data table in the MaxCompute project only in the production environment.

Click the icon next to Table to go to the table creation tab.

In the DDL section of the table creation tab, enter the following SQL code. Then, the system automatically generates all information for the table.

CREATE TABLE IF NOT EXISTS bank_data
(
 age             BIGINT COMMENT 'The age',
 job             STRING COMMENT 'The job type',
 marital         STRING COMMENT 'The marital status',
 education       STRING COMMENT 'The education level',
 default         STRING COMMENT 'Whether a credit card is possessed',
 housing         STRING COMMENT 'The mortgage',
 loan            STRING COMMENT 'The loan',
 contact         STRING COMMENT 'The contact information',
 month           STRING COMMENT 'The month',
 day_of_week     STRING COMMENT 'The day of the week',
 duration        STRING COMMENT 'The duration',
 campaign        BIGINT COMMENT 'The number of contacts during the campaign',
 pdays           DOUBLE COMMENT 'The time elapsed since the last contact',
 previous        DOUBLE COMMENT 'The number of contacts with the customer',
 poutcome        STRING COMMENT 'The result of the previous marketing campaign',
 emp_var_rate    DOUBLE COMMENT 'The employment change rate',
 cons_price_idx  DOUBLE COMMENT 'The consumer price index',
 cons_conf_idx   DOUBLE COMMENT 'The consumer confidence index',
 euribor3m       DOUBLE COMMENT 'The euro deposit rate',
 nr_employed     DOUBLE COMMENT 'The number of employees',
 y               BIGINT COMMENT 'Whether a time deposit is possessed'
);

In the top toolbar of the configuration tab, click Deploy to create the bank_data table in the MaxCompute project in the development environment.
After the bank_data table is created, click the table name in the MaxCompute section to view the detailed information of the table.

Step 2: Upload data to the `bank_data` table

Download the banking.csv file to your on-premises machine and then upload data in the file to the bank_data table created in the MaxCompute project. For more information, see Limits.

Important

Before you upload data in the file, make sure that you specify a resource group for scheduling and a resource group for Data Integration for data upload. For more information, see Limits.

In the upper-left corner of the DataWorks console, click the icon and choose All Products > Data Integration > Upload and Download.

In the Recent Upload Records section of the Upload and Download page, click Upload Data. On the Upload Data page, configure the parameters by referring to the following table.

Section		Description
Data Source		Select Local File.
Specify Data to Be Uploaded	Select File	Upload the `banking.csv` file that is downloaded to your on-premises machine.
Configure Destination Table	Compute Engine	Select MaxCompute.
	MaxComputeProject Name	Select the MaxCompute project to which the `bank_data` table belongs.
	Select Destination Table	Select `bank_data`.
Preview Data of Uploaded File		Click Mapping by Order to complete mapping between the data in the file to the fields in the `bank_data` table.

Note

You can upload files whose names are suffixed with .csv, .xls, .xlsx, and .json from your on-premises machine.
If you upload a file whose name is suffixed with .xls or .xlsx, the first sheet of the file is uploaded by default.
If you upload a file whose name is suffixed with .csv, the size of the file cannot exceed 5 GB. For other types of files, the file size cannot exceed 100 MB.

Click Upload Data to upload data in the CSV file to the bank_data table.
Check whether the data is written to the bank_data table.
After the data is uploaded, you can use the SQL query feature to check whether the data is written to the bank_data table.
1. Click the icon in the upper-left corner of the DataWorks console and choose All Products > Data Analysis > SQL Query.
2. In the SQL Query pane, click the icon next to My Files and select Create File. In the Create File dialog box, configure the File Name parameter based on your business requirements and click OK.
3. On the configuration tab of the file, enter the following SQL statement:
```
SELECT * FROM bank_data limit 10;
```
4. In the upper-right corner of the configuration tab of the SQL query file, select the workspace to which the bank_data table belongs and the desired MaxCompute data source, and click OK.
  Note
  In this tutorial, a workspace in standard mode is used, and the bank_data table is created only in the development environment. Therefore, you must select a MaxCompute data source in the development environment. If you use a workspace in basic mode, you can select a MaxCompute data source in the production environment.
5. In the top toolbar of the configuration tab, click the Run icon. In the Estimate Costs dialog box, click Run. After the SQL statement is executed, you can view the first ten data records in the bank_data table that are displayed in the lower part of the configuration tab. This indicates that the data in the file is successfully uploaded from your on-premises machine to the bank_data table.

Step 3: Process data

In this step, you need to use MaxCompute SQL nodes to filter data that is uploaded to the bank_data table to obtain the data about the distribution of education levels among single people who have mortgages, and write the processed data to the result_table table.

Build a data processing link

In the upper-left corner of the DataWorks console, click the icon and choose All Products > Data Development And Task Operation > DataStudio.
In the top navigation bar of the Data Studio page, switch to the workspace created for this tutorial. In the left-side navigation pane of the Data Studio page, click the icon.
In the Workspace Directories section of the DATA STUDIO pane, click the icon and select Create Workflow. In the Create Workflow dialog box, enter a name in the Name field and click OK to save the workflow. In this tutorial, the Name parameter is set to dw_basic_case.

On the configuration tab of the workflow, drag Zero Load Node and MaxCompute SQL from the left-side section to the canvas on the right, and specify names for the nodes.

The following table lists the node names that are used in this tutorial and the functionalities of the nodes.

Node type	Node name	Node functionality
Zero load node	`workshop_start`	This node is a zero load node used to manage all nodes involved in this tutorial. This node helps clarifies the data forwarding path. You do not need to write code for this node.
MaxCompute SQL	`ddl_result_table`	This node is used to create a table named result_table. This table is used to store the cleansed data in the bank_data table
MaxCompute SQL	`insert_result_table`	This node is used to filter data in the bank_data table and write the filtered data to the result_table table.

Draw lines to configure dependencies between the nodes, as shown in the following figure.
Note
You can draw lines to configure scheduling dependencies for nodes in a workflow. You can also use the automatic parsing feature to enable the system to automatically identify scheduling dependencies between nodes. In this tutorial, scheduling dependencies between nodes are configured by drawing lines. For information about the automatic parsing feature, see Use the automatic parsing feature.
In the top toolbar of the configuration tab, click Save.

Configure the data processing nodes

Configure the ddl_result_table node

This node is used to create the result_table table. The table is used to store the data about the distribution of education levels among single people who have mortgages. You can use the insert_result_table node to write the data in the table.

In the canvas of the workflow, move the pointer over the ddl_result_table node and click Open Node. In the code editor that appears, enter the following statement to create a table named result_table to store the data about the distribution of education levels among single people who have mortgages. You can use the insert_result_table node to query the data in the table.

Copy the following code and paste it in the code editor:

CREATE TABLE IF NOT EXISTS result_table
(  
 education   STRING COMMENT 'The education level',
 num         BIGINT COMMENT 'The number of persons'
);

Configure debugging parameters.
In the right-side navigation pane of the configuration tab of the MaxCompute SQL node, click Debugging Configurations. On the Debugging Configurations tab, make the following configurations:
1. Select the MaxCompute computing resource that is associated with the workspace when you make preparations from the Computing Resource drop-down list, and select the desired quota from the Computing Quota drop-down list.
2. Select the serverless resource group that you purchase when you make preparations from the Resource Group drop-down list.
In the top toolbar of the configuration tab, click Save.

Configure the insert_result_table node

This node is used to process and filter data in the bank_data table to obtain the data about the distribution of education levels among single people who have mortgages and write the data to the result_table table for subsequent data analysis and chart display.

In the canvas of the workflow, move the pointer over the insert_result_table node and click Open Node.

Copy the following code and paste it in the code editor:

INSERT OVERWRITE TABLE result_table  -- Insert data into result_table.
SELECT education
    , COUNT(marital) AS num
FROM bank_data
WHERE housing = 'yes'
    AND marital = 'single'
GROUP BY education;

Configure debugging parameters.
In the right-side navigation pane of the configuration tab of the MaxCompute SQL node, click Debugging Configurations. On the Debugging Configurations tab, make the following configurations:
1. Select the MaxCompute computing resource that is associated with the workspace when you make preparations from the Computing Resource drop-down list, and select the desired quota from the Computing Quota drop-down list.
2. Select the serverless resource group that you purchase when you make preparations from the Resource Group drop-down list.
In the top toolbar of the configuration tab, click Save.

Step 4: Debug and run the workflow

After the workflow is configured, go to the dw_basic_case workflow configuration page. Click the button to debug and run the entire workflow. If the workflow fails to run, you can troubleshoot the issue based on the debug logs.

Step 5: Query and display data

After the data is uploaded to the MaxCompute computing resource and processed in Data Studio, You can now query the result_table in SQL query and perform data analysis.

Click the icon in the upper-left corner of the DataWorks console and choose All Products > Data Analysis > SQL Query.
In the SQL Query pane, click the icon next to My Files and select Create File. In the Create File dialog box, configure the File Name parameter based on your business requirements and click OK.
On the configuration tab of the file, enter the following SQL statement:
```
SELECT * FROM result_table;
```
In the upper-right corner of the configuration tab of the SQL query file, select the workspace to which the result_table table belongs and the desired MaxCompute data source, and click OK.
Note
In this tutorial, a workspace in standard mode is used, and the result_table table is created only in the development environment and is not deployed to the production environment. Therefore, you must select a MaxCompute data source in the development environment. If you use a workspace in basic mode, you can select a MaxCompute data source in the production environment.
In the top toolbar of the configuration tab, click the Run icon. In the Estimate Costs dialog box, click Run.
On the tab that displays the execution result, click the icon to view the visualization result. You can click the icon in the upper-right corner of the tab to customize the graph pattern.
In the upper-right corner of the tab that displays the execution result, click the Save icon to save the chart as a card. In the left-side navigation pane of the SQL Query page, click the (Cards) icon to view the saved card.

Additional information

For information about details of operations in the modules involved in this tutorial and about parameter descriptions, see the topics in the DataStudio (new version) and Data Analysis directories.
In addition to the modules involved in this tutorial, DataWorks also supports other modules and features, such as Data Modeling, Data Quality, Data Security Guard, DataService Studio, Data Integration, and node scheduling configuration, to provide end-to-end data monitoring and O&M services. For more information, see Data Modeling, Data Quality, Data Security Guard, DataService Studio, Data Integration, and Node scheduling configuration.
You can also experience more DataWorks tutorials. For more information, see Tutorials for different business scenarios.

Appendix: Resource release and deletion

If you want to release resources or delete nodes created for this tutorial, perform the following operations:

Undeploy auto triggered nodes.
1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Operation Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Operation Center.
2. In the left-side navigation pane of the Operation Center page, choose Auto Triggered Node O&M > Auto Triggered Nodes. On the page that appears, select all the created nodes, click Actions in the lower part of the page, and then select Undeploy. The root node of the workspace does not need to be undeployed.
Delete the workflow and disassociate the MaxCompute computing resource from the workspace.
1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.
2. In the left navigation pane of the Data Studio page, click the icon. In the Workspace Directories section of the DATA STUDIO pane, find the created workflow, right-click the workflow name, and then select Delete.
3. In the left navigation pane of the Data Studio page, click the icon and select Computing Resources. On the Computing Resources tab, find the MaxCompute computing resource associated with the workspace, and click Disassociate. In the dialog box that appears, select the risk confirmation check box and disassociate the computing resource from the workspace as instructed.
Delete the MaxCompute project.
Go to the Projects page in the MaxCompute console, find the created MaxCompute project, click Delete in the Actions column, and then delete the project as instructed.
Delete the DataWorks workspace.
1. Log on to the DataWorks console. In the top navigation bar, select the region in which the workspace to delete resides. In the left-side navigation pane of the DataWorks console, click Workspace. On the Workspaces page, find the workspace that you want to delete, click the icon in the Actions column, and then select Delete Workspace.
2. In the Delete Workspace dialog box, select the check boxes for deletion confirmation, and click OK.

Introduction

Preparations

Activate DataWorks

Activate DataWorks for the first time

Activated but expired

Activated and not expired

Create a workspace

Create a resource group and associate it with the workspace

Create a MaxCompute project and associate the project with the workspace as a computing resource

Procedure

Step 1: Create a MaxCompute table

Step 2: Upload data to the bank_data table

Step 3: Process data

Build a data processing link

Configure the data processing nodes

Configure the ddl_result_table node

Configure the insert_result_table node

Step 4: Debug and run the workflow

Step 5: Query and display data

Additional information

Appendix: Resource release and deletion

Step 2: Upload data to the `bank_data` table