All Products
Search
Document Center

DataWorks:Simple case: Analysis of house buying groups

Last Updated:Jun 17, 2026

This tutorial walks you through a house-buying group analysis, helping you quickly grasp the data development and data analysis processes in DataWorks.

Introduction

This tutorial analyzes house buying situations of different groups based on house buying data. In this tutorial, on-premises data is uploaded to a MaxCompute table named bank_data by using DataWorks, and MaxCompute SQL nodes are used to analyze user groups to obtain a table named result_table. Finally, simple visualization and analysis are performed based on result_table to obtain user group profile data.

Note

This tutorial demonstrates DataWorks features based on simulated data. In actual business scenarios, make adjustments based on your business data.

The following figure shows the data forwarding path and data development process in this tutorial.

image

After data analysis is complete, you can obtain the following group analysis profile from the house buying data: The education levels of single people who have mortgages are mainly university.degree and high.school.

image

Preparations

Activate DataWorks

This tutorial uses the China (Shanghai) region as an example. Log on to the DataWorks console, switch to the China (Shanghai) region, and check whether DataWorks is activated in this region.

Note

This tutorial uses China (Shanghai) as an example. In practice, select the region based on the actual location of your business data:

  • If your business data is stored in other Alibaba Cloud services, select the same region as those services.

  • If your business is on-premises and requires public network access, select a region that is geographically close to your location to reduce latency.

New users

If you are a new user using DataWorks for the first time, the page will indicate that DataWorks has not been activated in the current region. Click Purchase Product Portfolio for Free.

  1. Configure the parameters on the combined purchase page.

    Parameter

    Description

    Example

    Region

    Select the region in which you want to activate DataWorks.

    China (Shanghai)

    DataWorks Edition

    Select the DataWorks edition to purchase.

    Note

    This tutorial uses Basic Edition as an example. All editions support the features described in this tutorial. You can refer to Edition comparison and select a DataWorks edition based on your business requirements.

    Basic Edition

  2. Click Confirm Order and Pay and complete the payment.

Previously activated but expired

If you previously activated DataWorks in the China (Shanghai) region but the DataWorks edition has expired, the following prompt appears. Click Purchase Edition.

The page shows the message You must have both a DataWorks edition and pay-as-you-go resources. A red cross icon indicates that no DataWorks edition is currently active in this region, and a green check icon indicates that pay-as-you-go resources are already available in this region.

  1. Configure the parameters on the purchase page.

    Parameter

    Description

    Example

    Version

    Select the DataWorks edition to purchase.

    Note

    This tutorial uses Basic Edition as an example. All editions support the features described in this tutorial. You can refer to Edition comparison and select a DataWorks edition based on your business requirements.

    Basic Edition

    Region

    Select the region in which you want to activate DataWorks.

    China (Shanghai)

  2. Click Buy Now and complete the payment.

Important

After you purchase a DataWorks edition, if you cannot find the edition, try the following:

  • Wait a few minutes and refresh the page. The system update may be delayed.

  • Check whether the current region matches the region in which you purchased the DataWorks edition. Region mismatches may cause the edition to be invisible.

Already activated

If you have already activated DataWorks in the China (Shanghai) region, the DataWorks overview page appears and you can proceed to the next step.

Create a workspace

  1. Go to the Workspaces page, switch to the China (Shanghai) region, and then click Create Workspace.

  2. On the Create Workspace page, configure the Workspace Name parameter based on your business requirements, turn on Isolate Development and Production Environments and Participate in Public Preview of Data Studio, and then click Create Workspace.

    Note
    • In this tutorial, a workspace in Standard Mode is used with Isolate Development and Production Environments enabled. For more details, see Workspace mode differences.

    • After February 18, 2025, if you use an Alibaba Cloud account to activate DataWorks in the China (Shanghai) region for the first time and create a workspace, the Participate in Public Preview of Data Studio parameter is not displayed and new-version Data Studio is activated by default.

Create a resource group and associate it with a workspace

  1. Go to the DataWorks resource group list page, switch to the China (Shanghai) region, and click Create Resource Group.

  2. On the resource group purchase page, configure the following parameters.

    Parameter

    Description

    Resource Group Name

    Custom.

    VPC, vSwitch

    Select an existing VPC and vSwitch. If none exists in the current region, click the corresponding console link in the parameter description to create one.

    Service-linked Role

    Follow the on-screen instructions to create the service-linked role.

  3. Click Buy Now and complete the payment.

  4. Go to the DataWorks resource group list page, switch to the China (Shanghai) region, find the resource group you created, and click Associate Workspace in the Operation column.

  5. On the Associate Workspace page, find the workspace you created and click Bind in the Operation column.

Create a MaxCompute project and associate the project with the workspace as a computing resource

This tutorial requires you to create a MaxCompute project and associate it with the workspace as a computing resource for receiving data and performing big data analysis in subsequent operations.

  1. Go to the Workspaces page in the DataWorks console, switch to the China (Shanghai) region, find the created workspace, and then click the workspace name to go to the Workspace Details page.

  2. In the left navigation pane, click Computing Resource. On the Computing Resource page, click Associate Computing Resource. In the Associate Computing Resource panel, click MaxCompute. In the Associate MaxCompute Computing Resource panel, configure the following key parameters to create a MaxCompute project and associate it with the workspace as a computing resource.

    Note

    You can retain default values for the parameters that are not described in the following table.

    Parameter

    Description

    MaxCompute Project

    Click the MaxCompute Project drop-down list and then click Create. In the Create Project dialog box, configure the following parameters:

    • Project Name: Specify a name based on your business requirements. It is recommended to follow the MaxCompute naming conventions:

      • Development Environment: project_name_dev.

      • Production Environment: project_name.

      • Make sure that the specified name is globally unique.

    • Billing Method: Select Pay-as-you-go.

      Note

      If pay-as-you-go is unavailable, click Activate next to Pay-as-you-go to activate the MaxCompute service.

    • Default Quota: Select an existing quota from the drop-down list.

    Default Access Identity

    In the Production Environment pane, select Alibaba Cloud Account.

    Computing Resource Instance Name

    Specify a name for the computing resource. When you run tasks in subsequent operations, you can select a computing resource instance name to specify the computing resource that you want to use. In this tutorial, the computing resource is named MaxCompute_Source.

  3. Click OK.

Procedure

In this tutorial, you will use DataWorks to upload the provided test data to a MaxCompute project and create a workflow in Data Studio to cleanse and write the data. You will also debug and run the workflow, and execute SQL statements to query and verify the results.

Step 1: Create a MaxCompute table

Before uploading the test data, you need to use the Data Catalog feature in DataWorks to create a table named bank_data in the MaxCompute project to store the incoming data.

  1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane of the DataWorks console, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the Select Workspace drop-down list and click Go to Data Studio.

  2. In the left-side navigation pane of the Data Studio page, click the image icon to go to the Data Catalog pane.

  3. (Optional) If your MaxCompute project is not listed in the Data Catalog, click the image icon next to MaxCompute to go to the DataWorks Data Sources tab. There, add your MaxCompute project (which has already been configured as a computing resource or data source) to the workspace as a data catalog.

  4. In the Data Catalog pane, click MaxCompute. Under your MaxCompute project, navigate to the Table. Click the image icon to create a new table.

    Note
    • If the schema feature is enabled for the MaxCompute project, open the schema in the MaxCompute project before creating a MaxCompute table in the Table folder under the MaxCompute project.

    • In this tutorial, a workspace in Standard Mode is used, and debugging is performed only in the development environment. Therefore, you only need to create the bank_data table in the development environment of the MaxCompute project. If you use a workspace in Basic Mode, you need to create the bank_data table in the production environment instead.

  5. In the DDL section of the table creation tab, enter the following SQL code. Then, the system automatically generates all information for the table.

    CREATE TABLE IF NOT EXISTS bank_data
    (
     age             BIGINT COMMENT 'The age',
     job             STRING COMMENT 'The job type',
     marital         STRING COMMENT 'The marital status',
     education       STRING COMMENT 'The education level',
     default         STRING COMMENT 'Whether a credit card is possessed',
     housing         STRING COMMENT 'The mortgage',
     loan            STRING COMMENT 'The loan',
     contact         STRING COMMENT 'The contact information',
     month           STRING COMMENT 'The month',
     day_of_week     STRING COMMENT 'The day of the week',
     duration        STRING COMMENT 'The duration',
     campaign        BIGINT COMMENT 'The number of contacts during the campaign',
     pdays           DOUBLE COMMENT 'The time elapsed since the last contact',
     previous        DOUBLE COMMENT 'The number of contacts with the customer',
     poutcome        STRING COMMENT 'The result of the previous marketing campaign',
     emp_var_rate    DOUBLE COMMENT 'The employment change rate',
     cons_price_idx  DOUBLE COMMENT 'The consumer price index',
     cons_conf_idx   DOUBLE COMMENT 'The consumer confidence index',
     euribor3m       DOUBLE COMMENT 'The euro deposit rate',
     nr_employed     DOUBLE COMMENT 'The number of employees',
     y               BIGINT COMMENT 'Whether a time deposit is possessed'
    );
  6. In the top toolbar of the configuration tab, click Deploy to create the bank_data table in the MaxCompute project in the development environment.

  7. After the bank_data table is created, click the table name in the MaxCompute section to view the detailed information of the table.

Step 2: Upload data to the bank_data table

Download the banking.csv file to your on-premises machine and then upload data in the file to the bank_data table created in the MaxCompute project. For more information, see Upload data.

Important

Before uploading data in the file, make sure that you specify a resource group for scheduling and a resource group for Data Integration for data upload. For more information, see Limitations.

  1. In the upper-left corner of the DataWorks console, click the image icon and choose All Products > Data Integration > Upload and Download.

  2. In the Recent Upload Records section, click Upload Data. On the Upload Data page, configure the parameters by referring to the following table.

    Section

    Description

    Data Source

    Local file.

    Specify Data to Be Uploaded

    Select File

    Upload the banking.csv file that is downloaded to your on-premises machine.

    Configure Destination Table

    Compute Engine

    MaxCompute.

    MaxComputeProject Name

    Select the MaxCompute project to which the bank_data table belongs.

    Select Destination Table

    Select bank_data. If bank_data is not displayed, refer to the FAQ to refresh the table metadata.

    Preview Data of Uploaded File

    Click Mapping by Order to complete mapping between the data in the file to the fields in the bank_data table.

    Note
    • Upload files whose names are suffixed with .csv, .xls, .xlsx, and .json from your on-premises machine.

    • If you upload a file whose name is suffixed with .xls or .xlsx, the first sheet of the file is uploaded by default.

    • If you upload a file whose name is suffixed with .csv, the size of the file cannot exceed 5 GB. For other types of files, the file size cannot exceed 100 MB.

    • Retain default values for the parameters that are not described in the table.

  3. Click Upload Data to upload data in the CSV file to the bank_data table.

  4. Check whether the data is written to the bank_data table.

    After the data is uploaded, you can use the SQL query (legacy) feature to check whether the data is written to the bank_data table.

    1. Click the image icon in the upper-left corner. In the pop-up page, click All Products > DataAnalysis > SQL Query.

    2. After My Files, click image > Create File. Specify a custom File Name and click Determine.

    3. On the SQL query page, enter the following SQL statement.

      SELECT * FROM bank_data limit 10;
    4. In the upper-right corner of the configuration tab of the SQL query file, select the workspace to which the bank_data table belongs and the desired MaxCompute data source, and click OK.

      Note

      In this tutorial, a workspace in Standard Mode is used, and the bank_data table is created only in the development environment. Therefore, select a MaxCompute data source in the development environment. If you use a workspace in basic mode, you can select a MaxCompute data source in the production environment.

    5. In the top toolbar of the configuration tab, click the Run icon. In the Estimate Costs dialog box, click Run. After the SQL statement is executed, you can view the first ten data records in the bank_data table that are displayed in the lower part of the configuration tab. This indicates that the data in the file is successfully uploaded from your on-premises machine to thebank_data table.

      image

Step 3: Process data

In this step, you need to use MaxCompute SQL nodes to filter data that is uploaded to the bank_data table to obtain the data about the distribution of education levels among single people who have mortgages, and write the processed data to the result_table table.

Build a data processing link

  1. In the upper-left corner of the DataWorks console, click the 图标 icon and choose All Products > Data Development And Task Operation > DataStudio.

  2. In the top navigation bar of the Data Studio page, switch to the workspace created for this tutorial. In the left-side navigation pane of the Data Studio page, click the image icon.

  3. In the Workspace Directories section of the DATA STUDIO pane, click the image icon and select Create Workflow. In the Create Workflow dialog box, enter a name in the Name field and click OK to save the workflow. In this tutorial, the Name parameter is set to dw_basic_case.

  4. On the configuration tab of the workflow, drag Zero Load Node and MaxCompute SQL from the left-side section to the canvas on the right, and specify names for the nodes. The following table lists the node names that are used in this tutorial and the functionalities of the nodes.

    Node type

    Node name

    Node functionality

    image Zero load node

    workshop_start

    This node is a zero load node used to manage all nodes involved in this tutorial. This node helps clarify the data forwarding path. You do not need to write code for this node.

    image MaxCompute SQL

    ddl_result_table

    This node is used to create a table named result_table. This table is used to store the cleansed data in the bank_data table

    image MaxCompute SQL

    insert_result_table

    This node is used to filter data in the bank_data table and write the filtered data to the result_table table.

  5. Draw lines to configure dependencies between the nodes, as shown in the following figure.

    image

    Note

    You can draw lines to configure scheduling dependencies for nodes in a workflow. You can also use the automatic parsing feature to enable the system to automatically identify scheduling dependencies between nodes.

    In this tutorial, scheduling dependencies between nodes are configured by drawing lines. For information about the automatic parsing feature, see Use the automatic parsing feature.

  6. In the top toolbar of the configuration tab, click Save.

Configure the data processing nodes

Configure the ddl_result_table node

This node is used to create the result_table table. The table is used to store the data about the distribution of education levels among single people who have mortgages. You can use the insert_result_table node to write the data in the table.

  1. In the canvas of the workflow, move the pointer over the ddl_result_table node and click Open Node. This node provides the result_table table for the insert_result_table node to write the distribution of education levels among single people who have mortgages.

  2. Copy the following code and paste it into the code editor:

    CREATE TABLE IF NOT EXISTS result_table
    (  
     education   STRING COMMENT 'The education level',
     num         BIGINT COMMENT 'The number of persons'
    );
  3. Configure debugging parameters.

    In the right-side navigation pane of the configuration tab of the MaxCompute SQL node, click Debugging Configurations. On the Debugging Configurations tab, make the following configurations:

    1. Select the MaxCompute computing resource that is associated with the workspace when you make preparations from the Computing Resource drop-down list, and select the desired quota from the Computing Quota drop-down list.

    2. Select the serverless resource group that you purchase when you make preparations from the Resource Group drop-down list.

  4. In the top toolbar of the configuration tab, click Save.

Configure the insert_result_table node

This node is used to process and filter data in the bank_data table to obtain the data about the distribution of education levels among single people who have mortgages and write the data to the result_table table for subsequent data analysis and chart display.

  1. In the canvas of the workflow, move the pointer over the insert_result_table node and click Open Node.

  2. Copy the following code and paste it in the code editor:

    INSERT OVERWRITE TABLE result_table  -- Insert data into result_table.
    SELECT education
        , COUNT(marital) AS num
    FROM bank_data
    WHERE housing = 'yes'
        AND marital = 'single'
    GROUP BY education;
  3. Configure debugging parameters.

    In the right-side navigation pane of the configuration tab of the MaxCompute SQL node, click Debugging Configurations. On the Debugging Configurations tab, make the following configurations:

    1. Select the MaxCompute computing resource that is associated with the workspace when you make preparations from the Computing Resource drop-down list, and select the desired quota from the Computing Quota drop-down list.

    2. Select the serverless resource group that you purchase when you make preparations from the Resource Group drop-down list.

  4. In the top toolbar of the configuration tab, click Save.

Step 4: Debug and run the workflow

After the dw_basic_case workflow is configured, you can debug and run the workflow on the Data Studio page to check whether the configurations of the workflow are correct. To debug and run the workflow, go to the configuration tab of the workflow again.

  1. After you complete the development of the inner nodes of the workflow, switch to the configuration tab of the workflow and click the Save icon in the top toolbar of the configuration tab to save the workflow.

  2. After the workflow is saved, click the Run icon in the top toolbar of the configuration tab to debug and run the workflow.

  3. After the running of the workflow is complete, view the result. The following figure shows the expected result.

    image

Step 5: Query and display data

After the data is uploaded to the MaxCompute computing resource and processed in Data Studio, you can query the data and view the data analysis result.

  1. Click the image icon in the upper-left corner. In the pop-up page, click All Products > DataAnalysis > SQL Query.

  2. In the SQL Query pane, click the image icon next to My Files and select Create File. In the Create File dialog box, configure the File Name parameter based on your business requirements and click OK.

  3. On the SQL query page, enter the following SQL statement.

    SELECT * FROM result_table;
  4. In the upper-right corner of the configuration tab of the SQL query file, select the workspace to which the result_table table belongs and the desired MaxCompute data source, and click OK.

    Note

    In this tutorial, a workspace in Standard Mode is used, and the result_table table is created only in the development environment and is not deployed to the production environment. Therefore, select a MaxCompute data source in the development environment. If you use a workspace in basic mode, you can select a MaxCompute data source in the production environment.

  5. In the top toolbar of the configuration tab, click the Run icon. In the Estimate Costs dialog box, click Run.

  6. On the tab that displays the execution result, click the image icon to view the visualization result. You can click the image icon in the upper-right corner of the tab to customize the graph pattern.

  7. In the upper-right corner of the tab that displays the execution result, click the Save icon to save the chart as a card. In the left-side navigation pane of the SQL Query page, click the image (Cards) icon to view the saved card.

    image

What to do next

Appendix: Resource release and deletion

If you want to release resources or delete nodes created for this tutorial, perform the following operations:

  1. Undeploy auto triggered nodes.

    1. Log on to the DataWorks console. In the target region, click Data Development and O&M > Operation Center in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Operation Center.

    2. In Auto Triggered Task O&M > Auto Triggered Task, select all the scheduled tasks you created previously (the workspace root node does not need to be undeployed), and then click Operation > Undeploy at the bottom.

  2. Delete the workflow and disassociate the MaxCompute computing resource from the workspace.

    1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

    2. In the left navigation pane of the Data Studio page, click the image icon. In the Workspace Directories section of the DATA STUDIO pane, find the created workflow, right-click the workflow name, and then select Delete.

    3. In the left-side navigation pane, click image > Computing Resource Management. Find the associated MaxCompute compute resource and click Disassociate. In the confirmation dialog, select the options and follow the instructions to complete the disassociation.

  3. Delete the MaxCompute project.

    Go to the MaxCompute project management page. Find the MaxCompute project you created and click Delete in the Operation column. Follow the instructions to complete the deletion.

  4. Delete the DataWorks workspace.

    1. Log on to the DataWorks console. In the top navigation bar, select the region in which the workspace you want to delete is located. In the left-side navigation pane of the DataWorks console, click Workspace. On the Workspaces page, find the workspace that you want to delete, click the image icon in the Actions column, and then select Delete Workspace.

    2. In the Delete Workspace dialog box, select the confirmation check boxes and click OK.

FAQ

The target table bank_data is not displayed when uploading data

Go to All Products > DataMap > My Data > My Tools > Refresh Table Metadata. The Guid should follow the format "project.table" or "odps.project.table", where:

  • project is the project name. You can find it by clicking on bank_data in the Data Catalog in Step 1, and checking the Basic Information panel on the right.

  • table is the table name, which is bank_data.