This tutorial walks you through a house-buying group analysis, helping you quickly grasp the data development and data analysis processes in DataWorks.
Introduction
This tutorial analyzes house buying situations of different groups based on house buying data. In this tutorial, on-premises data is uploaded to a MaxCompute table named bank_data by using DataWorks, and MaxCompute SQL nodes are used to analyze user groups to obtain a table named result_table. Finally, simple visualization and analysis are performed based on result_table to obtain user group profile data.
This tutorial demonstrates DataWorks features based on simulated data. In actual business scenarios, make adjustments based on your business data.
The following figure shows the data forwarding path and data development process in this tutorial.
After data analysis is complete, you can obtain the following group analysis profile from the house buying data: The education levels of single people who have mortgages are mainly university.degree and high.school.

Preparations
Activate DataWorks
Create a workspace
Create a resource group and associate it with a workspace
Create a MaxCompute project and associate the project with the workspace as a computing resource
Procedure
In this tutorial, you will use DataWorks to upload the provided test data to a MaxCompute project and create a workflow in Data Studio to cleanse and write the data. You will also debug and run the workflow, and execute SQL statements to query and verify the results.
Step 1: Create a MaxCompute table
Before uploading the test data, you need to use the Data Catalog feature in DataWorks to create a table named bank_data in the MaxCompute project to store the incoming data.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane of the DataWorks console, choose . On the page that appears, select the desired workspace from the Select Workspace drop-down list and click Go to Data Studio.
In the left-side navigation pane of the Data Studio page, click the
icon to go to the Data Catalog pane.(Optional) If your MaxCompute project is not listed in the Data Catalog, click the
icon next to MaxCompute to go to the DataWorks Data Sources tab. There, add your MaxCompute project (which has already been configured as a computing resource or data source) to the workspace as a data catalog.In the Data Catalog pane, click MaxCompute. Under your MaxCompute project, navigate to the Table. Click the
icon to create a new table.NoteIf the schema feature is enabled for the MaxCompute project, open the schema in the MaxCompute project before creating a MaxCompute table in the Table folder under the MaxCompute project.
In this tutorial, a workspace in Standard Mode is used, and debugging is performed only in the development environment. Therefore, you only need to create the
bank_datatable in the development environment of the MaxCompute project. If you use a workspace in Basic Mode, you need to create thebank_datatable in the production environment instead.
In the DDL section of the table creation tab, enter the following SQL code. Then, the system automatically generates all information for the table.
CREATE TABLE IF NOT EXISTS bank_data ( age BIGINT COMMENT 'The age', job STRING COMMENT 'The job type', marital STRING COMMENT 'The marital status', education STRING COMMENT 'The education level', default STRING COMMENT 'Whether a credit card is possessed', housing STRING COMMENT 'The mortgage', loan STRING COMMENT 'The loan', contact STRING COMMENT 'The contact information', month STRING COMMENT 'The month', day_of_week STRING COMMENT 'The day of the week', duration STRING COMMENT 'The duration', campaign BIGINT COMMENT 'The number of contacts during the campaign', pdays DOUBLE COMMENT 'The time elapsed since the last contact', previous DOUBLE COMMENT 'The number of contacts with the customer', poutcome STRING COMMENT 'The result of the previous marketing campaign', emp_var_rate DOUBLE COMMENT 'The employment change rate', cons_price_idx DOUBLE COMMENT 'The consumer price index', cons_conf_idx DOUBLE COMMENT 'The consumer confidence index', euribor3m DOUBLE COMMENT 'The euro deposit rate', nr_employed DOUBLE COMMENT 'The number of employees', y BIGINT COMMENT 'Whether a time deposit is possessed' );In the top toolbar of the configuration tab, click Deploy to create the
bank_datatable in the MaxCompute project in the development environment.After the
bank_datatable is created, click the table name in the MaxCompute section to view the detailed information of the table.
Step 2: Upload data to the bank_data table
Download the banking.csv file to your on-premises machine and then upload data in the file to the bank_data table created in the MaxCompute project. For more information, see Upload data.
Before uploading data in the file, make sure that you specify a resource group for scheduling and a resource group for Data Integration for data upload. For more information, see Limitations.
In the upper-left corner of the DataWorks console, click the
icon and choose .In the Recent Upload Records section, click Upload Data. On the Upload Data page, configure the parameters by referring to the following table.
Section
Description
Data Source
Local file.
Specify Data to Be Uploaded
Select File
Upload the
banking.csvfile that is downloaded to your on-premises machine.Configure Destination Table
Compute Engine
MaxCompute.
MaxComputeProject Name
Select the MaxCompute project to which the
bank_datatable belongs.Select Destination Table
Select
bank_data. Ifbank_datais not displayed, refer to the FAQ to refresh the table metadata.Preview Data of Uploaded File
Click Mapping by Order to complete mapping between the data in the file to the fields in the
bank_datatable.NoteUpload files whose names are suffixed with
.csv,.xls,.xlsx, and.jsonfrom your on-premises machine.If you upload a file whose name is suffixed with .xls or .xlsx, the first sheet of the file is uploaded by default.
If you upload a file whose name is suffixed with
.csv, the size of the file cannot exceed 5 GB. For other types of files, the file size cannot exceed 100 MB.Retain default values for the parameters that are not described in the table.
Click Upload Data to upload data in the CSV file to the
bank_datatable.Check whether the data is written to the
bank_datatable.After the data is uploaded, you can use the SQL query (legacy) feature to check whether the data is written to the
bank_datatable.Click the
icon in the upper-left corner. In the pop-up page, click All Products > DataAnalysis > SQL Query.After My Files, click . Specify a custom File Name and click Determine.
On the SQL query page, enter the following SQL statement.
SELECT * FROM bank_data limit 10;In the upper-right corner of the configuration tab of the SQL query file, select the workspace to which the
bank_datatable belongs and the desired MaxCompute data source, and click OK.NoteIn this tutorial, a workspace in Standard Mode is used, and the
bank_datatable is created only in the development environment. Therefore, select a MaxCompute data source in the development environment. If you use a workspace in basic mode, you can select a MaxCompute data source in the production environment.In the top toolbar of the configuration tab, click the Run icon. In the Estimate Costs dialog box, click Run. After the SQL statement is executed, you can view the first ten data records in the
bank_datatable that are displayed in the lower part of the configuration tab. This indicates that the data in the file is successfully uploaded from your on-premises machine to thebank_datatable.
Step 3: Process data
In this step, you need to use MaxCompute SQL nodes to filter data that is uploaded to the bank_data table to obtain the data about the distribution of education levels among single people who have mortgages, and write the processed data to the result_table table.
Build a data processing link
In the upper-left corner of the DataWorks console, click the
icon and choose .In the top navigation bar of the Data Studio page, switch to the workspace created for this tutorial. In the left-side navigation pane of the Data Studio page, click the
icon.In the Workspace Directories section of the DATA STUDIO pane, click the
icon and select Create Workflow. In the Create Workflow dialog box, enter a name in the Name field and click OK to save the workflow. In this tutorial, the Name parameter is set to dw_basic_case.On the configuration tab of the workflow, drag Zero Load Node and MaxCompute SQL from the left-side section to the canvas on the right, and specify names for the nodes. The following table lists the node names that are used in this tutorial and the functionalities of the nodes.
Node type
Node name
Node functionality
Zero load nodeworkshop_startThis node is a zero load node used to manage all nodes involved in this tutorial. This node helps clarify the data forwarding path. You do not need to write code for this node.
MaxCompute SQLddl_result_tableThis node is used to create a table named
result_table. This table is used to store the cleansed data in the bank_data table
MaxCompute SQLinsert_result_tableThis node is used to filter data in the
bank_datatable and write the filtered data to theresult_tabletable.Draw lines to configure dependencies between the nodes, as shown in the following figure.
NoteYou can draw lines to configure scheduling dependencies for nodes in a workflow. You can also use the automatic parsing feature to enable the system to automatically identify scheduling dependencies between nodes.
In this tutorial, scheduling dependencies between nodes are configured by drawing lines. For information about the automatic parsing feature, see Use the automatic parsing feature.
In the top toolbar of the configuration tab, click Save.
Configure the data processing nodes
Step 4: Debug and run the workflow
After the dw_basic_case workflow is configured, you can debug and run the workflow on the Data Studio page to check whether the configurations of the workflow are correct. To debug and run the workflow, go to the configuration tab of the workflow again.
After you complete the development of the inner nodes of the workflow, switch to the configuration tab of the workflow and click the Save icon in the top toolbar of the configuration tab to save the workflow.
After the workflow is saved, click the Run icon in the top toolbar of the configuration tab to debug and run the workflow.
After the running of the workflow is complete, view the result. The following figure shows the expected result.

Step 5: Query and display data
After the data is uploaded to the MaxCompute computing resource and processed in Data Studio, you can query the data and view the data analysis result.
Click the
icon in the upper-left corner. In the pop-up page, click All Products > DataAnalysis > SQL Query.In the SQL Query pane, click the
icon next to My Files and select Create File. In the Create File dialog box, configure the File Name parameter based on your business requirements and click OK.On the SQL query page, enter the following SQL statement.
SELECT * FROM result_table;In the upper-right corner of the configuration tab of the SQL query file, select the workspace to which the
result_tabletable belongs and the desired MaxCompute data source, and click OK.NoteIn this tutorial, a workspace in Standard Mode is used, and the
result_tabletable is created only in the development environment and is not deployed to the production environment. Therefore, select a MaxCompute data source in the development environment. If you use a workspace in basic mode, you can select a MaxCompute data source in the production environment.In the top toolbar of the configuration tab, click the Run icon. In the Estimate Costs dialog box, click Run.
On the tab that displays the execution result, click the
icon to view the visualization result. You can click the
icon in the upper-right corner of the tab to customize the graph pattern.In the upper-right corner of the tab that displays the execution result, click the Save icon to save the chart as a card. In the left-side navigation pane of the SQL Query page, click the
(Cards) icon to view the saved card.
What to do next
For detailed information about the operations performed in the modules used in this tutorial, as well as parameter descriptions, see the topics in the Data Studio (new version) and Data Analysis.
In addition to the modules covered in this tutorial, DataWorks also supports other modules and features, such as Data Modeling, Data Quality, Data Security Guard, DataService Studio, Data Integration, and node scheduling configuration, to provide end-to-end data monitoring and O&M services. For more information, see Data Modeling, Data quality, Data Security Guard, DataService Studio, Data Integration, and Node scheduling configuration.
You can also experience more DataWorks tutorials. For more information, see Tutorials for different business scenarios.
Appendix: Resource release and deletion
FAQ
The target table bank_data is not displayed when uploading data
Go to . The Guid should follow the format "project.table" or "odps.project.table", where:
project is the project name. You can find it by clicking on
bank_datain the Data Catalog in Step 1, and checking the Basic Information panel on the right.table is the table name, which is
bank_data.