All Products
Search
Document Center

DataWorks:Basics: Analyzing home buyer groups

Last Updated:Jan 27, 2026

This tutorial demonstrates how to analyze home buyer groups to help you master DataWorks Data Development and Data Analysis.

Case introduction

This tutorial analyzes purchasing behavior based on home buyer data. You will use DataWorks to upload local data to a MaxCompute bank_data table, analyze user groups using a MaxCompute SQL node to generate a result_table, and visualize the results to create group profiles.

Note

This tutorial uses simulated data. In actual scenarios, replace this with your own business data.

The following flowchart illustrates the data flow and development process.

image

The analysis yields the following profile: Single home buyers with loans primarily hold university.degree or high.school diplomas.

image

Prerequisites

Activate DataWorks

This tutorial uses the Singapore region. Log in to the DataWorks Console, switch to the Singapore region, and check if DataWorks is activated in that region.

Note

This tutorial uses Singapore. Select the region where your data resides:

  • If your business data resides in other Alibaba Cloud services, select the same region.

  • If your business is on-premises and requires access via the public network, select a region geographically closer to you to reduce access latency.

New user

New users will see the following prompt. Click Purchase Product Portfolio for Free.

image

  1. Configure the parameters on the combination purchase page.

    Parameter

    Description

    Example

    Region

    Select the target region.

    Singapore

    DataWorks Edition

    Select the DataWorks edition to purchase.

    Note

    This tutorial uses Basic Edition as an example. All editions can experience the features involved in this tutorial. You can refer to Features of DataWorks editions to select the appropriate DataWorks edition based on your actual business needs.

    Basic Edition

  2. Click Confirm Order and Pay to complete the payment.

Activated but expired

If you have previously activated DataWorks in the Singapore region but the DataWorks edition has expired, the following prompt will appear, and you need to click Purchase Edition.

image

  1. Configure the parameters on the purchase page.

    Parameter

    Description

    Example

    Edition

    Select the DataWorks edition to purchase.

    Note

    This tutorial uses Basic Edition as an example. All editions can experience the features involved in this tutorial. You can refer to Features of DataWorks editions to select the appropriate DataWorks edition based on your actual business needs.

    Basic Edition

    Region and Zone

    Select the region where you want to activate DataWorks.

    Singapore

  2. Click Buy Now to complete the payment.

Important

After purchasing a DataWorks edition, if you cannot find the relevant DataWorks edition, perform the following operations:

  • Wait a few minutes and refresh the page, as there may be a delay in system updates.

  • Check if the current region matches the region where you purchased the DataWorks edition to prevent failing to find the relevant DataWorks edition due to incorrect region selection.

Activated

If you have already activated DataWorks in the Singapore region, you will enter the DataWorks overview page and can proceed directly to the next step.

Create a workspace

  1. On the DataWorks Workspace List page, select Singapore and click Create Workspace.

  2. On the Create Workspace page, enter a custom Workspace Name, enable Use Data Studio (New Version), and click Create Workspace.

    Note

    After February 18, 2025, new workspaces created by primary accounts in Singapore enable the new DataStudio by default. The Use Data Studio (New Version) option will not appear.

Create and associate resources

  1. Go to the DataWorks Resource Group List page, switch to the Singapore region, and click Create Resource Group.

  2. On the resource group purchase page, configure the following parameters.

    Parameter

    Description

    Resource Group Name

    Custom.

    VPC, vSwitch

    Select an existing VPC and vSwitch. If there are none in the current region, click the console link in the parameter description to create them.

    Service-linked Role

    Follow the on-screen instructions to create a service-linked role.

  3. Click Buy Now to complete the payment.

  4. Go to the DataWorks Resource Groups page, switch to the Singapore region, find the created resource group, and click Associate Workspace in the Actions column.

  5. On the Associate Workspace page, find the created DataWorks workspace and click Associate in its Actions column.

Associate MaxCompute resources

Create a MaxCompute project and associate it with DataWorks for data ingestion and analysis.

  1. Go to the DataWorks Workspace List page, switch to the Singapore region, find the created workspace, and click the workspace name to enter the Workspace Details page.

  2. In the left navigation pane, click Computing Resource to enter the computing resources page. Click Associate Computing Resource and select the MaxCompute type. Configure the following key parameters to create a MaxCompute project and associate it as a DataWorks computing resource.

    Note

    Keep the default values for parameters not mentioned in the table.

    Parameter

    Description

    MaxCompute Project

    Click Create in the drop-down selection box and fill in the following parameters.

    • Project Name: Custom, unique across the entire network.

    • Billing Method: Select Pay-as-you-go.

      Note

      If pay-as-you-go is not selectable, click Activate to complete the activation of the MaxCompute service.

    • Default Quota: Select an existing default Quota from the drop-down list.

    Default Access Identity

    Select Alibaba Cloud Account.

    Computing Resource Instance Name

    Identifies the resource for task execution. For example, in this tutorial, it is named MaxCompute_Source.

  3. Click OK.

Procedure

In this tutorial, you will use DataWorks to upload test data to a MaxCompute project. Then, you will create a DataStudio workflow to clean and write data, debug the workflow, and verify the results using SQL.

Step 1: Create a table

First, use Data Catalog in DataWorks to create a bank_data table in MaxCompute.

  1. Log on to the DataWorks Console. Switch to the target region, click Data Development and Operations > Data Development in the left navigation pane, select the corresponding workspace from the drop-down list, and then click Go to Data Studio.

  2. Click the image icon in the left navigation pane to go to the Data Catalog page.

  3. (Optional) If your MaxCompute project is missing from Data Catalog, click the image icon, go to DataWorks Data Sources, and add the project.

  4. Click to expand the MaxCompute directory, select the target MaxCompute project, and create a MaxCompute table in the Table folder.

    Note
    • If the schema feature is enabled for your MaxCompute project, you must select the target schema after selecting the project to create the MaxCompute table in the Table folder.

    • This example uses a standard mode workspace. Create the bank_data table in the development environment only. If you are using a simple mode workspace, you only need to create the bank_data table in the MaxCompute project corresponding to the production environment.

  5. Click the image icon to open the table editing page.

    Enter the following SQL statement in the DDL section. The system will automatically generate the table information.

    CREATE TABLE IF NOT EXISTS bank_data (
        age             BIGINT   COMMENT 'Age',
        job             STRING   COMMENT 'Job type',
        marital         STRING   COMMENT 'Marital status',
        education       STRING   COMMENT 'Education level',
        `default`       STRING   COMMENT 'Has credit card',
        housing         STRING   COMMENT 'Housing loan',
        loan            STRING   COMMENT 'Loan',
        contact         STRING   COMMENT 'Contact method',
        month           STRING   COMMENT 'Month',
        day_of_week     STRING   COMMENT 'Day of week',
        duration        STRING   COMMENT 'Duration',
        campaign        BIGINT   COMMENT 'Number of contacts in this campaign',
        pdays           DOUBLE   COMMENT 'Interval from last contact',
        previous        DOUBLE   COMMENT 'Number of previous contacts',
        poutcome        STRING   COMMENT 'Outcome of previous marketing campaign',
        emp_var_rate    DOUBLE   COMMENT 'Employment variation rate',
        cons_price_idx  DOUBLE   COMMENT 'Consumer price index',
        cons_conf_idx   DOUBLE   COMMENT 'Consumer confidence index',
        euribor3m       DOUBLE   COMMENT 'Euribor 3 month rate',
        nr_employed     DOUBLE   COMMENT 'Number of employees',
        y               BIGINT   COMMENT 'Has term deposit'
    );
  6. On the editing page, click Deploy to create the bank_data table in the MaxCompute project corresponding to the development environment.

  7. After the bank_data table is created, you can click the table name in the Data Catalog to view the table details.

Step 2: Upload data

Download the banking.csv file. Use the DataWorks upload feature to upload it to the bank_data table.

Important

Ensure that a Scheduling Resource Group and a Data Integration Resource Group are configured before uploading. For details, see Data upload Limitations.

  1. Click the image icon and choose All Products > Data Integration > Upload and Download to go to the Upload & Download page.

  2. Click Upload Data and configure the following settings:

    Parameter

    Description

    Data Source

    Local file.

    Specify Data to Be Uploaded

    Data Source Type

    Upload the local banking.csv file.

    Configure Destination Table

    Target Engine

    MaxCompute

    MaxCompute Project Name

    Select the project containing the bank_data table.

    Select Destination Table

    Select the bank_data table as the target table.

    Preview Data of Uploaded File

    Click Mapping by Order to map data to table fields.

    Note
    • Local files support .csv, .xls, .xlsx, and .json formats.

    • For spreadsheet files, the first sheet is uploaded by default.

    • The maximum size for .csv files is 5 GB. For other file types, the limit is 100 MB.

  3. Click Upload Data to upload the data from the downloaded CSV file to the bank_data table in the MaxCompute computing resource.

  4. Verify the upload.

    Verify the data in the bank_data table via SQL Query (Legacy).

    1. Click the image icon in the upper-left corner, and click All Products > Data Analysis > SQL Query in the pop-up page.

    2. Click image > Create File next to My Files, customize the File Name, and click OK.

    3. On the SQL Query page, configure the following SQL.

      SELECT * FROM bank_data limit 10;
    4. Select the workspace and MaxCompute data source where the bank_data table resides in the upper-right corner, and then click OK.

      Note

      This example uses a standard mode workspace and the bank_data table is created only in the development environment. Therefore, you must select the MaxCompute data source for the development environment. If you are using a simple mode workspace, you can select the MaxCompute data source for the production environment.

    5. Click Run (confirm cost estimation if prompted). The bottom pane displays the first 10 records, confirming the upload.

      image

Step 3: Process data

Use a MaxCompute SQL node to filter the bank_data table for the education levels of single home buyers with loans, and then write the results to the result_table.

Build the data processing pipeline

  1. Click the icon icon in the upper-left corner and choose All Products > Data Development and O&M > Data Development.

  2. Switch to the workspace created in this tutorial at the top of the page. Click image in the left navigation pane to go to Data Studio.

  3. In Workspace Directories, click image > Create Workflow. Name it dw_basic_case and click OK.

  4. Drag a Zero Load Node and two MaxCompute SQL nodes onto the canvas. Rename them as follows:

    The node names and functions used in this tutorial are as follows:

    Type

    Name

    Function

    image Zero Load

    workshop_start

    Manages the workflow structure. This is a no-op task requiring no code.

    image MaxCompute SQL

    ddl_result_table

    Creates the result_table to store the cleaned data from bank_data.

    image MaxCompute SQL

    insert_result_table

    Filters the bank_data and writes the results to the result_table.

  5. Connect the nodes as shown:

    image

    Note

    Workflows support configuring upstream/downstream dependencies via manual connection or by automatically identifying dependencies through code parsing. This tutorial uses the manual connection method. For more information, see Automatic dependency parsing.

  6. Click Save in the node toolbar.

Configure data processing nodes

Configure ddl_result_table node

This node creates result_table to store the analysis results.

  1. Open the ddl_result_table node.

  2. Paste the following code into the node editing page.

    CREATE TABLE IF NOT EXISTS result_table(
      education STRING COMMENT'Education level',
      num       BIGINT COMMENT'Count'
    );
  3. Configure debug parameters.

    Click Running Configurations on the right side of the MaxCompute SQL node editing page:

    • Set Computing Resource to the bound MaxCompute resource Prerequisites.

    • Configure the Resource Group parameter by selecting the Serverless resource group purchased during preparation Prerequisites.

  4. Click Save in the node toolbar.

Configure insert_result_table node

This node queries bank_data and populates result_table.

  1. On the workflow editing page, hover over the insert_result_table node and click Open Node.

  2. Paste the following code into the node editing page.

    INSERT OVERWRITE TABLE result_table --Insert data into result_table.
    SELECT
      education,
      COUNT(marital) AS num
    FROM bank_data
    WHERE 
      housing = 'yes'
      AND marital = 'single'
    GROUP BY
    education;
  3. Configure debug parameters.

    Click Running Configurations on the right side of the MaxCompute SQL node editing page:

    • Set Computing Resource to the bound MaxCompute resource Prerequisites.

    • Configure the Resource Group parameter by selecting the Serverless resource group purchased during preparation Prerequisites.

  4. Click Save in the node toolbar.

Step 4: Debug and run

Click the image icon to execute the workflow. Check the logs if any failures occur.

image

Step 5: Data query and display

Data processing is complete. Query the result_table and analyze the data in SQL Query (Legacy).

  1. Click the image icon in the upper-left corner, and click All Products > Data Analysis > SQL Query in the pop-up page.

  2. Click image > Create File next to My Files, customize the File Name, and click OK.

  3. On the SQL Query page, configure the following SQL.

    SELECT * FROM result_table;
  4. Select the workspace and MaxCompute data source where the result_table table resides in the upper-right corner, and then click OK.

    Note

    This example uses a standard mode workspace. result_table exists only in the development environment, so select the corresponding data source. If you are using a simple mode workspace, you can select the MaxCompute data source for the production environment.

  5. Click the Run button at the top. In the cost estimation page, click Run.

  6. Click image in the query results to view the visualized chart results. You can click image in the upper-right corner of the chart to customize the chart style.

  7. You can also click Save in the upper-right corner of the chart to save the chart as a card, and then click Card (image) in the left navigation pane to view it.

    image

Next steps

Resource release and cleanup

To release resources:

  1. Stop auto-triggered tasks.

    1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Operation Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Operation Center.

    2. In Auto Triggered Node O&M > Auto Triggered Nodes, select all previously created periodic tasks (the workspace root node does not need to be taken offline), and then click More Actions > Undeploy at the bottom.

  2. Delete nodes and unbind MaxCompute resources.

    1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

    2. In the left navigation pane of DataStudio, click image to enter the data development page. Then, in the Workspace Directoies area, find the created workflow, right-click the workflow, and click Delete.

    3. In the left navigation pane, click image > Computing Resources, find the associated MaxCompute computing resource, and click Disassociate. In the confirmation window, check the options and follow the instructions.

  3. Delete MaxCompute project.

    Go to the MaxCompute Project Management page, find the created MaxCompute project, click Delete in the Actions column, and follow the instructions to complete the deletion.

  4. Delete DataWorks workspace.

    1. In the DataWorks Console, find the workspace and click Actions image > Delete Workspace.

    2. In the Delete Workspace dialog box, click OK to delete the workspace.