This topic guides you through house buying group analysis to help you quickly understand the data development and data analysis processes in DataWorks.
Introduction
This tutorial analyzes house buying situations of different groups based on house buying data. In this tutorial, on-premises data is uploaded to a MaxCompute table named bank_data by using DataWorks, and MaxCompute SQL nodes are used to analyze user groups to obtain a table named result_table. Finally, simple visualization and analysis are performed based on result_table to obtain user group profile data.
This tutorial demonstrates DataWorks features based on simulated data. In actual business scenarios, you must make adjustments based on your business data.
The following figure shows the data forwarding path and data development process in this tutorial.
After data analysis is complete, you can obtain the following group analysis profile from the house buying data: The education levels of single people who have mortgages are mainly university.degree
and high.school
.
Preparations
Activate DataWorks
Create a workspace
Create a resource group and associate it with the workspace
Create a MaxCompute project and associate the project with the workspace as a computing resource
Procedure
In this tutorial, you need to use DataWorks to upload the test data provided for this tutorial to a MaxCompute project and create a workflow in Data Studio to cleanse and write the test data. In addition, you need to debug and run the workflow and execute SQL statements to query and verify the running result.
Step 1: Create a MaxCompute table
Before you upload the test data, you need to use the data catalog feature of DataWorks to create a table named bank_data
in the MaxCompute project to store the test data to be uploaded.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane of the DataWorks console, choose . On the page that appears, select the desired workspace from the Select Workspace drop-down list and click Go to Data Studio.
In the left-side navigation pane of the Data Studio page, click the
icon to go to the Data Catalog pane.
(Optional) If your MaxCompute project is not added to DataWorks as a data catalog, click the
icon next to MaxCompute to go to the DataWorks Data Sources tab. On the DataWorks Data Sources tab, add the MaxCompute project that is already added as a computing resource or data source to the workspace as a MaxCompute data catalog.
In the DATA CATALOG pane, click MaxCompute. In the MaxCompute section, create a MaxCompute table in the Table folder under the MaxCompute project that is added as a data catalog.
NoteIf the schema feature is enabled for the MaxCompute project, you must open the schema in the MaxCompute project before you can create a MaxCompute table in the Table folder under the MaxCompute project.
In this tutorial, a workspace in standard mode is used, and debugging needs to be performed only in the development environment. Therefore, you need to create the
bank_data
table in the MaxCompute project only in the development environment. If you use a workspace in basic mode, you need to create thebank_data
table in the MaxCompute project only in the production environment.
Click the
icon next to Table to go to the table creation tab.
In the DDL section of the table creation tab, enter the following SQL code. Then, the system automatically generates all information for the table.
CREATE TABLE IF NOT EXISTS bank_data ( age BIGINT COMMENT 'The age', job STRING COMMENT 'The job type', marital STRING COMMENT 'The marital status', education STRING COMMENT 'The education level', default STRING COMMENT 'Whether a credit card is possessed', housing STRING COMMENT 'The mortgage', loan STRING COMMENT 'The loan', contact STRING COMMENT 'The contact information', month STRING COMMENT 'The month', day_of_week STRING COMMENT 'The day of the week', duration STRING COMMENT 'The duration', campaign BIGINT COMMENT 'The number of contacts during the campaign', pdays DOUBLE COMMENT 'The time elapsed since the last contact', previous DOUBLE COMMENT 'The number of contacts with the customer', poutcome STRING COMMENT 'The result of the previous marketing campaign', emp_var_rate DOUBLE COMMENT 'The employment change rate', cons_price_idx DOUBLE COMMENT 'The consumer price index', cons_conf_idx DOUBLE COMMENT 'The consumer confidence index', euribor3m DOUBLE COMMENT 'The euro deposit rate', nr_employed DOUBLE COMMENT 'The number of employees', y BIGINT COMMENT 'Whether a time deposit is possessed' );
In the top toolbar of the configuration tab, click Deploy to create the
bank_data
table in the MaxCompute project in the development environment.After the
bank_data
table is created, click the table name in the MaxCompute section to view the detailed information of the table.
Step 2: Upload data to the bank_data
table
Download the banking.csv file to your on-premises machine and then upload data in the file to the bank_data
table created in the MaxCompute project. For more information, see Limits.
Before you upload data in the file, make sure that you specify a resource group for scheduling and a resource group for Data Integration for data upload. For more information, see Limits.
In the upper-left corner of the DataWorks console, click the
icon and choose .
In the Recent Upload Records section of the Upload and Download page, click Upload Data. On the Upload Data page, configure the parameters by referring to the following table.
Section
Description
Data Source
Select Local File.
Specify Data to Be Uploaded
Select File
Upload the
banking.csv
file that is downloaded to your on-premises machine.Configure Destination Table
Compute Engine
Select MaxCompute.
MaxComputeProject Name
Select the MaxCompute project to which the
bank_data
table belongs.Select Destination Table
Select
bank_data
.Preview Data of Uploaded File
Click Mapping by Order to complete mapping between the data in the file to the fields in the
bank_data
table.NoteYou can upload files whose names are suffixed with
.csv
,.xls
,.xlsx
, and.json
from your on-premises machine.If you upload a file whose name is suffixed with .xls or .xlsx, the first sheet of the file is uploaded by default.
If you upload a file whose name is suffixed with
.csv
, the size of the file cannot exceed 5 GB. For other types of files, the file size cannot exceed 100 MB.
Click Upload Data to upload data in the CSV file to the
bank_data
table.Check whether the data is written to the bank_data table.
After the data is uploaded, you can use the SQL query feature to check whether the data is written to the
bank_data
table.Click the
icon in the upper-left corner of the DataWorks console and choose .
In the SQL Query pane, click the
icon next to My Files and select Create File. In the Create File dialog box, configure the File Name parameter based on your business requirements and click OK.
On the configuration tab of the file, enter the following SQL statement:
SELECT * FROM bank_data limit 10;
In the upper-right corner of the configuration tab of the SQL query file, select the workspace to which the
bank_data
table belongs and the desired MaxCompute data source, and click OK.NoteIn this tutorial, a workspace in standard mode is used, and the
bank_data
table is created only in the development environment. Therefore, you must select a MaxCompute data source in the development environment. If you use a workspace in basic mode, you can select a MaxCompute data source in the production environment.In the top toolbar of the configuration tab, click the Run icon. In the Estimate Costs dialog box, click Run. After the SQL statement is executed, you can view the first ten data records in the bank_data table that are displayed in the lower part of the configuration tab. This indicates that the data in the file is successfully uploaded from your on-premises machine to the bank_data table.
Step 3: Process data
In this step, you need to use MaxCompute SQL nodes to filter data that is uploaded to the bank_data
table to obtain the data about the distribution of education levels among single people who have mortgages, and write the processed data to the result_table
table.
Build a data processing link
In the upper-left corner of the DataWorks console, click the
icon and choose .
In the top navigation bar of the Data Studio page, switch to the workspace created for this tutorial. In the left-side navigation pane of the Data Studio page, click the
icon.
In the Workspace Directories section of the DATA STUDIO pane, click the
icon and select Create Workflow. In the Create Workflow dialog box, enter a name in the Name field and click OK to save the workflow. In this tutorial, the Name parameter is set to
dw_basic_case
.On the configuration tab of the workflow, drag Zero Load Node and MaxCompute SQL from the left-side section to the canvas on the right, and specify names for the nodes.
The following table lists the node names that are used in this tutorial and the functionalities of the nodes.
Node type
Node name
Node functionality
Zero load node
workshop_start
This node is a zero load node used to manage all nodes involved in this tutorial. This node helps clarifies the data forwarding path. You do not need to write code for this node.
MaxCompute SQL
ddl_result_table
This node is used to create a table named result_table. This table is used to store the cleansed data in the bank_data table
MaxCompute SQL
insert_result_table
This node is used to filter data in the bank_data table and write the filtered data to the result_table table.
Draw lines to configure dependencies between the nodes, as shown in the following figure.
NoteYou can draw lines to configure scheduling dependencies for nodes in a workflow. You can also use the automatic parsing feature to enable the system to automatically identify scheduling dependencies between nodes. In this tutorial, scheduling dependencies between nodes are configured by drawing lines. For information about the automatic parsing feature, see Use the automatic parsing feature.
In the top toolbar of the configuration tab, click Save.
Configure the data processing nodes
Step 4: Debug and run the workflow
After the dw_basic_case
workflow is configured, you can debug and run the workflow on the Data Studio page to check whether the configurations of the workflow are correct. To debug and run the workflow, you must go to the configuration tab of the workflow again.
After you complete the development of the inner nodes of the workflow, switch to the configuration tab of the workflow and click the Save icon in the top toolbar of the configuration tab to save the workflow.
After the workflow is saved, click the Run icon in the top toolbar of the configuration tab to debug and run the workflow.
After the running of the workflow is complete, view the result. The following figure shows the expected result.
Step 5: Query and display data
After the data is uploaded to the MaxCompute computing resource and processed in Data Studio, you can query the data and view the data analysis result.
Click the
icon in the upper-left corner of the DataWorks console and choose .
In the SQL Query pane, click the
icon next to My Files and select Create File. In the Create File dialog box, configure the File Name parameter based on your business requirements and click OK.
On the configuration tab of the file, enter the following SQL statement:
SELECT * FROM result_table;
In the upper-right corner of the configuration tab of the SQL query file, select the workspace to which the
result_table
table belongs and the desired MaxCompute data source, and click OK.NoteIn this tutorial, a workspace in standard mode is used, and the
result_table
table is created only in the development environment and is not deployed to the production environment. Therefore, you must select a MaxCompute data source in the development environment. If you use a workspace in basic mode, you can select a MaxCompute data source in the production environment.In the top toolbar of the configuration tab, click the Run icon. In the Estimate Costs dialog box, click Run.
On the tab that displays the execution result, click the
icon to view the visualization result. You can click the
icon in the upper-right corner of the tab to customize the graph pattern.
In the upper-right corner of the tab that displays the execution result, click the Save icon to save the chart as a card. In the left-side navigation pane of the SQL Query page, click the
(Cards) icon to view the saved card.
Additional information
For information about details of operations in the modules involved in this tutorial and about parameter descriptions, see the topics in the Data Studio (new version) and Data Analysis directories.
In addition to the modules involved in this tutorial, DataWorks also supports other modules and features, such as Data Modeling, Data Quality, Data Security Guard, DataService Studio, Data Integration, and node scheduling configuration, to provide end-to-end data monitoring and O&M services. For more information, see Data Modeling, Data Quality, Data Security Guard, DataService Studio, Data Integration, and Node scheduling configuration.
You can also experience more DataWorks tutorials. For more information, see Tutorials for different business scenarios.
Appendix: Resource release and deletion
References
For information about how to build a data warehouse based on AnalyticDB, see Build a data warehouse for an enterprise based on AnalyticDB for MySQL.
For information about more tutorials, see Tutorials for different business scenarios.