All Products
Search
Document Center

DataWorks:Advanced: Analyze best-selling product categories

Last Updated:Jan 27, 2026

This tutorial covers data ingestion, processing, scheduling, and visualization using core DataWorks features.

Introduction

This tutorial demonstrates how to build a data pipeline—from raw data ingestion to analysis and visualization—using an e-commerce scenario. A standardized process helps you quickly build reusable data flows with reliable scheduling and observability. This lowers the barrier for big data applications, enabling business users to extract value without managing technical details.

You will perform the following tasks:

  1. Data synchronization: Create a batch synchronization task in Data Integration to move business data to a compute platform like MaxCompute.

  2. Data cleaning: Clean, analyze, and mine data in Data Studio.

  3. Data visualization: Visualize analysis results in Data Analysis for easier business interpretation.

  4. Scheduling: Schedule synchronization and cleaning tasks to run automatically.

image

You will synchronize raw product and order data from a public source to MaxCompute, then analyze it to generate a daily ranking of best-selling categories:

image

Prerequisites

Use an Alibaba Cloud account or a RAM user with the AliyunDataWorksFullAccess permission. For more information, see Prepare an Alibaba Cloud account or Prepare a RAM user.

Note

DataWorks supports granular permission control at product and module levels. For details, see Overview of the DataWorks permission management system.

Preparation

Activate DataWorks

This tutorial uses the Singapore region. Log in to the DataWorks Console, switch to the Singapore region, and check if DataWorks is activated in that region.

Note

This tutorial uses Singapore. Select the region where your data resides:

  • If your business data resides in other Alibaba Cloud services, select the same region.

  • If your business is on-premises and requires access via the public network, select a region geographically closer to you to reduce access latency.

New user

New users will see the following prompt. Click Purchase Product Portfolio for Free.

image

  1. Configure the parameters on the combination purchase page.

    Parameter

    Description

    Example

    Region

    Select the target region.

    Singapore

    DataWorks Edition

    Select the DataWorks edition to purchase.

    Note

    This tutorial uses Basic Edition as an example. All editions can experience the features involved in this tutorial. You can refer to Features of DataWorks editions to select the appropriate DataWorks edition based on your actual business needs.

    Basic Edition

  2. Click Confirm Order and Pay to complete the payment.

Activated but expired

If you have previously activated DataWorks in the Singapore region but the DataWorks edition has expired, the following prompt will appear, and you need to click Purchase Edition.

image

  1. Configure the parameters on the purchase page.

    Parameter

    Description

    Example

    Edition

    Select the DataWorks edition to purchase.

    Note

    This tutorial uses Basic Edition as an example. All editions can experience the features involved in this tutorial. You can refer to Features of DataWorks editions to select the appropriate DataWorks edition based on your actual business needs.

    Basic Edition

    Region and Zone

    Select the region where you want to activate DataWorks.

    Singapore

  2. Click Buy Now to complete the payment.

Important

After purchasing a DataWorks edition, if you cannot find the relevant DataWorks edition, perform the following operations:

  • Wait a few minutes and refresh the page, as there may be a delay in system updates.

  • Check if the current region matches the region where you purchased the DataWorks edition to prevent failing to find the relevant DataWorks edition due to incorrect region selection.

Activated

If you have already activated DataWorks in the Singapore region, you will enter the DataWorks overview page and can proceed directly to the next step.

Create a workspace

  1. On the DataWorks Workspace List page, select Singapore and click Create Workspace.

  2. On the Create Workspace page, enter a custom Workspace Name, enable Use Data Studio (New Version), and click Create Workspace.

    Note

    After February 18, 2025, new workspaces created by primary accounts in Singapore enable the new DataStudio by default. The Use Data Studio (New Version) option will not appear.

Create and associate resources

  1. Go to the DataWorks Resource Group List page, switch to the Singapore region, and click Create Resource Group.

  2. On the resource group purchase page, configure the following parameters.

    Parameter

    Description

    Resource Group Name

    Custom.

    VPC, vSwitch

    Select an existing VPC and vSwitch. If there are none in the current region, click the console link in the parameter description to create them.

    Service-linked Role

    Follow the on-screen instructions to create a service-linked role.

  3. Click Buy Now to complete the payment.

  4. Go to the DataWorks Resource Groups page, switch to the Singapore region, find the created resource group, and click Associate Workspace in the Actions column.

  5. On the Associate Workspace page, find the created DataWorks workspace and click Associate in its Actions column.

Enable public network access

The sample data for this tutorial requires public network access. The general-purpose resource group lacks this by default. You must configure an Internet NAT gateway with an EIP for the bound VPC to enable access.

  1. Log in to the VPC - Internet NAT Gateway Console, switch to the Singapore region in the top navigation bar, and click Create Internet NAT Gateway. Configure the relevant parameters.

    Note

    Keep the default values for parameters not mentioned in the table.

    Parameter

    Value

    Region

    Singapore.

    Network and Zone

    Select the VPC and vSwitch bound to the resource group.

    You can go to the DataWorks Resource Groups page, switch to the Singapore region, find the created resource group, click Network Settings in the Actions column, and view the VPC Binding in the Data Scheduling & Data Integration section. For more information about VPC and vSwitch, see What is VPC?.

    Network Type

    Internet NAT Gateway.

    Elastic IP Address (EIP)

    Purchase EIP.

    Service-linked Role

    When creating a NAT gateway for the first time, you need to create a service-linked role. Click Create Service-Linked Role.

  2. Click Buy Now to complete the payment and create the NAT gateway instance.

    image

  3. After the NAT gateway instance is purchased successfully, return to the console to create an SNAT entry for the newly purchased NAT gateway instance.

    Note

    The resource group can access the public network only after an SNAT entry is configured.

    1. Click the Manage button in the Actions column of the newly purchased instance to enter the management page of the target NAT gateway instance, and switch to the Configure SNAT tab.

    2. Under SNAT Entry List, click the Create SNAT Entry button to create a NAT entry. The key configurations are as follows:

      Parameter

      Value

      SNAT Entry

      Select Specify VPC to allow all resource groups in the VPC to access the internet via the EIP.

      Select EIP

      Configure the Elastic IP Address bound to the current NAT gateway instance.

      After completing the SNAT entry parameter configuration, click OK to create the SNAT entry.

    When the SNAT entry Status becomes Available, the VPC has internet access.

Associate MaxCompute resources

Create a MaxCompute project and associate it with DataWorks for data ingestion and analysis.

  1. Go to the DataWorks Workspace List page, switch to the Singapore region, find the created workspace, and click the workspace name to enter the Workspace Details page.

  2. In the left navigation pane, click Computing Resource to enter the computing resources page. Click Associate Computing Resource and select the MaxCompute type. Configure the following key parameters to create a MaxCompute project and associate it as a DataWorks computing resource.

    Note

    Keep the default values for parameters not mentioned in the table.

    Parameter

    Description

    MaxCompute Project

    Click Create in the drop-down selection box and fill in the following parameters.

    • Project Name: Custom, unique across the entire network.

    • Billing Method: Select Pay-as-you-go.

      Note

      If pay-as-you-go is not selectable, click Activate to complete the activation of the MaxCompute service.

    • Default Quota: Select an existing default Quota from the drop-down list.

    Default Access Identity

    Select Alibaba Cloud Account.

    Computing Resource Instance Name

    Identifies the resource for task execution. For example, in this tutorial, it is named MaxCompute_Source.

  3. Click OK.

Procedure

This tutorial uses the following scenario as an example to guide you through the quick experience of DataWorks features:

An e-commerce platform stores product and order data in MySQL. The goal is to analyze order data and visualize daily rankings of best-selling categories.

Step 1: Data synchronization

Create a data source

Create a MySQL data source to connect to the database hosting the sample data.

Note

DataWorks provides a public MySQL database with sample data. You do not need to prepare raw data. The relevant table data is stored in a public MySQL database, and you only need to create a MySQL data source to access it.

  1. Go to the DataWorks Management Center page, switch to the Singapore region, select the created workspace from the drop-down box, and click Go to Management Center.

  2. In the left navigation pane, click Data Sources. Click Add Data Source, select the MySQL type, and configure the MySQL data source parameters.

    Note
    • Retain default values for parameters not listed.

    • First-time users must complete cross-service authorization. Follow the prompts to authorize AliyunDIDefaultRole.

    Parameter

    Description

    Data Source Name

    In this example, it is MySQL_Source.

    Configuration Mode

    Select Connection String Mode.

    Endpoint

    • Host Address IP: rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com

    • Port Number: 3306.

    Important

    The data provided in this tutorial is solely for practicing data applications on the DataWorks. All data is test data and only supports reading in the Data Integration module.

    Database Name

    Set to retail_e_commerce.

    Username

    Enter the username workshop.

    Password

    Enter the password workshop#2017.

  3. In the Connection Configuration section, switch to the Data Integration tab, find the resource group associated with the workspace, and click Test Network Connectivity in the Connectivity Status column.

    Note

    If the MySQL data source connectivity test fails, perform the following operations:

    • Complete the follow-up operations of the connectivity diagnostic tool.

    • Check if an EIP is configured for the VPC bound to the resource group, as the MySQL data source requires the resource group to have public network access capability. For details, see Enable public network access.

  4. Click Complete Creation.

Build a synchronization pipeline

In this step, you need to build a synchronization pipeline to synchronize product order data from the e-commerce platform to tables in MaxCompute to prepare for subsequent data processing.

  1. Click the icon icon in the upper-left corner and select All Products > Data Development and O&M > Data Studio to enter the data development page.

  2. Switch to the workspace created in this tutorial at the top of the page, and click image in the left navigation pane to enter the Workspace Directories page.

  3. In the Workspace Directories area, click image, select Create Workflow, and set the workflow name. In this tutorial, it is set to dw_quickstart.

  4. On the workflow orchestration page, drag Zero Load and Batch Synchronization nodes from the left side to the canvas, and set the node names respectively.

    The node names and functions are described below:

    Node Type

    Node Name

    Node Function

    imageZero Load

    workshop

    Used to manage the entire user profile analysis workflow, making the data flow path clearer. This node is a dry run task and does not require code editing.

    imageBatch Synchronization Node

    ods_item_info

    Used to synchronize the product information source table item_info stored in MySQL to the ods_item_info table in MaxCompute.

    imageBatch Synchronization Node

    ods_trade_order

    Used to synchronize the order information source table trade_order stored in MySQL to the ods_trade_order table in MaxCompute.

    Manually drag lines to set the workshop node as the upstream node for the two batch synchronization nodes. The final effect is as follows:

    image
  5. Workflow scheduling configuration.

    Click Scheduling on the right side of the workflow orchestration page to configure relevant parameters. The following are the key parameters required for this tutorial. Retain default values for parameters not listed.

    Parameter

    Description

    Scheduling Parameters

    Set scheduling parameters for the entire workflow, which can be directly used by internal nodes in the workflow.

    In this tutorial, configure it as bizdate=$[yyyymmdd-1] to obtain the date of the previous day.

    Note

    DataWorks provides scheduling parameters to enable dynamic code input. You can define variables in SQL code using the ${Variable Name} format and assign values to these variables in Scheduling > Scheduling Parameters. For details on supported formats for scheduling parameters, see Supported formats for scheduling parameters.

    Scheduling Cycle

    In this tutorial, configure it as Day.

    Scheduling Time

    In this tutorial, set Scheduling Time to 00:30. The workflow will start at 00:30 every day.

    Scheduling Dependencies

    The workflow has no upstream dependency, so this can be left unconfigured. For easier unified management, you can click Use Workspace Root Node to mount the workflow under the workspace root node.

    The naming format for the workspace root node is: WorkspaceName_root.

Configure the synchronization task

Configure initial node
  1. On the workflow orchestration page, hover over the workshop node and click Open Node.

  2. Click Scheduling on the right side of the workshop node editing page to configure relevant parameters. The following are the key parameters required for this tutorial. Retain default values for parameters not listed.

    Parameter

    Description

    Scheduling Type

    In this tutorial, configure it as Dry-run.

    Resource Group

    In this tutorial, configure it as the serverless resource group created in Create and associate resources.

    Scheduling Dependencies

    Since workshop is the initial node and has no upstream dependency, you can click Use Workspace Root Node to trigger workflow execution by the workspace root node.

    The naming format for the workspace root node is: WorkspaceName_root.

  3. Click Save in the node toolbar to save the node.

Configure product info pipeline
  1. On the workflow orchestration page, hover over the ods_item_info node and click Open Node.

  2. Configure synchronization pipeline network and resources.

    Parameter

    Description

    Scheduling Type

    • Data source: MySQL.

    • Data source name: MySQL_Source.

    Resource Group

    Select the serverless resource group purchased in Create and associate resources.

    Scheduling Dependencies

    • Data destination: MaxCompute(ODPS).

    • Data source name: Select the MaxCompute computing resource associated in Associate MaxCompute resources. In this example, it is MaxCompute_Source.

  3. Click Next to configure the synchronization task.

    1. Configure data source and destination

      Note

      Retain default values for parameters not listed.

      Area

      Parameter

      Description

      Source

      Table

      item_info.

      Destination

      Table

      Click Generate Destination Table Schema to quickly create a MaxCompute table. Paste the following table creation statement into the Table Creation Statement area and click Create Table. This table is used to receive product information from the data source.

      Table Creation SQL

      CREATE TABLE IF NOT EXISTS ods_item_info(
        `id`                              BIGINT COMMENT '',
        `cate_id`                         BIGINT COMMENT '',
        `cate_name`                       STRING COMMENT '',
        `commodity_id`                    BIGINT COMMENT '',
        `commodity_name`                  STRING COMMENT '',
        `desc_path`                       STRING COMMENT '',
        `duration`                        BIGINT COMMENT '',
        `features`                        STRING COMMENT '',
        `gmt_create`                      DATETIME COMMENT '',
        `gmt_modified`                    DATETIME COMMENT '',
        `is_deleted`                      BIGINT COMMENT '',
        `is_virtual`                      STRING COMMENT '',
        `item_id`                         BIGINT COMMENT '',
        `item_status`                     BIGINT COMMENT '',
        `last_offline_time`               DATETIME COMMENT '',
        `last_online_quantity`            BIGINT COMMENT '',
        `last_online_time`                DATETIME COMMENT '',
        `pict_url`                        STRING COMMENT '',
        `reserve_price`                   DECIMAL(38,18) COMMENT '',
        `secure_trade_ems_post_fee`       DECIMAL(38,18) COMMENT '',
        `secure_trade_fast_post_fee`      DECIMAL(38,18) COMMENT '',
        `secure_trade_ordinary_post_fee`  DECIMAL(38,18) COMMENT '',
        `shop_id`                         BIGINT COMMENT '',
        `shop_nick`                       STRING COMMENT '',
        `sub_title`                       STRING COMMENT '',
        `title`                           STRING COMMENT ''
      )
      COMMENT ''
      PARTITIONED BY (pt STRING) 
      lifecycle 36500;

      Partition Information

      In this tutorial, enter ${bizdate}, which is used to assign a constant value to the bizdate parameter during the subsequent testing phase and dynamically assign a value to the bizdate parameter during scheduled execution. For more information about variable formats and configuration methods supported by DataStudio, see Scheduling parameters.

    2. Confirm Field Mapping and Channel Control.

      DataWorks allows you to configure field mapping relationships between the source and destination to write data from specified source fields to specified destination fields. It also supports settings such as task concurrency and dirty data policies. In this tutorial, configure the Policy for Dirty Data Records to Disallow Dirty Data Records, and keep other settings as default. For more information, see Configure a task in the codeless UI.

  4. Click Save in the node toolbar to save the node.

Configure order data pipeline
  1. On the workflow orchestration page, hover over the ods_trade_order node and click Open Node.

  2. Configure synchronization pipeline network and resources.

    Parameter

    Description

    Scheduling Type

    • Data source: MySQL.

    • Data source name: MySQL_Source.

    Resource Group

    Select the serverless resource group purchased in Create and associate resources.

    Scheduling Dependencies

    • Data destination: MaxCompute(ODPS).

    • Data source name: Select the MaxCompute computing resource associated in Associate MaxCompute resources. In this example, it is MaxCompute_Source.

  3. Click Next to configure the synchronization task.

    1. Configure data source and destination

      Note

      Retain default values for parameters not listed.

      Area

      Parameter

      Description

      Source

      Table

      trade_order

      Destination

      Table

      Click Generate Destination Table Schema to quickly create a MaxCompute table. Paste the following table creation statement into the Table Creation Statement area and click Create Table. This table is used to receive product information from the data source.

      Table Creation SQL

      CREATE TABLE IF NOT EXISTS ods_trade_order(
        `id`                BIGINT COMMENT '',
        `biz_type`          BIGINT COMMENT '',
        `buy_amount`        BIGINT COMMENT '',
        `buyer_id`          BIGINT COMMENT '',
        `buyer_memo`        STRING COMMENT '',
        `buyer_nick`        STRING COMMENT '',
        `end_time`          DATETIME COMMENT '',
        `gmt_create`        DATETIME COMMENT '',
        `gmt_modified`      DATETIME COMMENT '',
        `ip`                STRING COMMENT '',
        `is_parent`         BIGINT COMMENT '',
        `is_sub`            BIGINT COMMENT '',
        `item_id`           BIGINT COMMENT '',
        `item_price`        DECIMAL(38,18) COMMENT '',
        `logistics_status`  BIGINT COMMENT '',
        `memo`              STRING COMMENT '',
        `parent_order_id`   BIGINT COMMENT '',
        `pay_status`        BIGINT COMMENT '',
        `pay_time`          DATETIME COMMENT '',
        `seller_memo`       STRING COMMENT '',
        `shop_id`           BIGINT COMMENT '',
        `status`            BIGINT COMMENT '',
        `sub_order_id`      BIGINT COMMENT '',
        `total_fee`         DECIMAL(38,18) COMMENT ''
      )
      COMMENT ''
      PARTITIONED BY (pt STRING) 
      lifecycle 36500;

      Partition Information

      In this tutorial, enter ${bizdate}, which is used to assign a constant value to the bizdate parameter during the subsequent testing phase and dynamically assign a value to the bizdate parameter during scheduled execution. For more information about variable formats and configuration methods supported by DataStudio, see Scheduling parameters.

    2. Confirm Field Mapping and Channel Control.

      DataWorks allows you to configure field mapping relationships between the source and destination to write data from specified source fields to specified destination fields. It also supports settings such as task concurrency and dirty data policies. In this tutorial, configure the Policy for Dirty Data Records to Disallow Dirty Data Records, and keep other settings as default. For more information, see Configure a task in the codeless UI.

  4. Click Save in the node toolbar to save the node.

Step 2: Data cleaning

After data is synchronized from MySQL to MaxCompute, resulting in two data tables (the product information table ods_item_info and the order information table ods_trade_order), you can clean, process, and analyze the data in the DataStudio module of DataWorks to obtain the daily ranking of best-selling product categories.

Build a data processing pipeline

  1. In the left navigation pane of DataStudio, click image to enter the data development page. Then, in the Workspace Directories area, find the created workflow, click to enter the workflow orchestration page, drag MaxCompute SQL nodes from the left side to the canvas, and set the node names respectively.

    The node names and functions are described below:

    Node Type

    Node Name

    Node Function

    imageMaxCompute SQL

    dim_item_info

    Processes product dimension data based on the ods_item_info table to produce the product basic information dimension table dim_item_info.

    imageMaxCompute SQL

    dwd_trade_order

    Performs initial cleaning, transformation, and business logic processing on detailed order transaction data based on the ods_trade_order table to produce the transaction order detail fact table dwd_trade_order.

    imageMaxCompute SQL

    dws_daily_category_sales

    Aggregates the cleaned and standardized detail data from the DWD layer based on the dwd_trade_order table and dim_item_info table to produce the daily product category sales summary table dws_daily_category_sales.

    imageMaxCompute SQL

    ads_top_selling_categories

    Produces the daily best-selling product category ranking table ads_top_selling_categories based on the dws_daily_category_sales table.

  2. Manually drag lines to configure the upstream nodes for each node. The final effect is as follows:

    image
    Note

    The workflow supports setting upstream and downstream dependencies for each node via manual connection. It also supports using code parsing within child nodes to automatically identify node dependencies. This tutorial uses the manual connection method. For more information about code parsing, see Automatic dependency parsing.

Configure data processing nodes

Configure dim_item_info

Processes product dimension data based on the ods_item_info table to produce the product basic information dimension table dim_item_info.

  1. On the workflow orchestration page, hover over the dim_item_info node and click Open Node.

  2. Paste the following code into the node editing page.

    CREATE TABLE IF NOT EXISTS dim_item_info (
        gmt_modified                   STRING COMMENT 'Product last modified date',
        gmt_create                     STRING COMMENT 'Product creation time',
        item_id                        BIGINT COMMENT 'Product numeric ID',
        title                          STRING COMMENT 'Product title',
        sub_title                      STRING COMMENT 'Product subtitle',
        pict_url                       STRING COMMENT 'Main image URL',
        desc_path                      STRING COMMENT 'Path of product description',
        item_status                    BIGINT COMMENT 'Product status 1: Confirmed 0: Not confirmed',
        last_online_time               DATETIME COMMENT 'Last time sales started, product listing time',
        last_offline_time              DATETIME COMMENT 'Sales end time, indicates the end of a sales cycle, only for auction items',
        duration                       BIGINT COMMENT 'Validity period, sales cycle, only two values, 7 days or 14 days',
        reserve_price                  DOUBLE COMMENT 'Current price',
        secure_trade_ordinary_post_fee DOUBLE COMMENT 'Standard mail fee',
        secure_trade_fast_post_fee     DOUBLE COMMENT 'Express mail fee',
        secure_trade_ems_post_fee      DOUBLE COMMENT 'EMS mail fee',
        last_online_quantity           BIGINT COMMENT 'Inventory quantity when the product was last listed',
        features                       STRING COMMENT 'Product features',
        cate_id                        BIGINT COMMENT 'Product leaf category ID',
        cate_name                      STRING COMMENT 'Product leaf category name',
        commodity_id                   BIGINT COMMENT 'Category ID',
        commodity_name                 STRING COMMENT 'Category name',
        is_virtual                     STRING COMMENT 'Whether it is a virtual product',
        shop_id                        BIGINT COMMENT 'Shop ID',
        shop_nick                      STRING COMMENT 'Shop NICK',
        is_deleted                     BIGINT COMMENT 'Whether the category is deleted'
    )
    COMMENT 'Product basic information dimension table'
    PARTITIONED BY (pt STRING COMMENT 'Data timestamp, yyyymmdd')
    LIFECYCLE 365;
    
    
    -- Insert data into the dim_item_info table
    INSERT OVERWRITE TABLE dim_item_info PARTITION(pt='${bizdate}')
    SELECT
        gmt_create,
        gmt_modified,
        item_id,
        title,
        sub_title,
        pict_url,
        desc_path,
        item_status,
        last_online_time,
        last_offline_time,
        duration,
        cast(reserve_price as DOUBLE),
        cast(secure_trade_ordinary_post_fee as DOUBLE),
        cast(secure_trade_fast_post_fee as DOUBLE),
        cast(secure_trade_ems_post_fee as DOUBLE),
        last_online_quantity,
        features,
        cate_id,
        cate_name,
        commodity_id,
        commodity_name,
        is_virtual,
        shop_id,
        shop_nick,
        is_deleted
    FROM ods_item_info
    WHERE pt = '${bizdate}';
  3. Configure debugging parameters.

    Click Running Configurations on the right side of the MaxCompute SQL node editing page:

  4. Click Save in the node toolbar to save the node.

Configure dwd_trad_order

Performs initial cleaning, transformation, and business logic processing on detailed order transaction data based on the ods_trade_order table to produce the transaction order detail fact table dwd_trade_order.

  1. On the workflow orchestration page, hover over the dwd_trade_order node and click Open Node.

  2. Paste the following code into the node editing page.

    CREATE TABLE IF NOT EXISTS dwd_trade_order (
        id               BIGINT COMMENT 'Primary key, latest ID after deduplication',
        gmt_create       DATETIME COMMENT 'Creation time',
        gmt_modified     DATETIME COMMENT 'Modification time',
        sub_order_id     BIGINT COMMENT 'Sub-order ID',
        parent_order_id  BIGINT COMMENT 'Parent order ID',
        buyer_id         BIGINT COMMENT 'Buyer numeric ID',
        buyer_nick       STRING COMMENT 'Buyer nickname, handles null values',
        item_id          BIGINT COMMENT 'Product numeric ID',
        item_price       DECIMAL(38,18) COMMENT 'Product price, in cents',
        buy_amount       BIGINT COMMENT 'Purchase quantity',
        biz_type         BIGINT COMMENT 'Transaction type',
        memo             STRING COMMENT 'Memo, handles null values',
        pay_status       BIGINT COMMENT 'Payment status',
        logistics_status BIGINT COMMENT 'Logistics status',
        status           BIGINT COMMENT 'Status',
        seller_memo      STRING COMMENT 'Seller memo for the transaction',
        buyer_memo       STRING COMMENT 'Buyer memo for the transaction',
        clean_ip         STRING COMMENT 'Cleansed buyer IP, filters invalid formats',
        end_time         DATETIME COMMENT 'Transaction end time',
        pay_time         DATETIME COMMENT 'Payment time',
        is_sub           BIGINT COMMENT 'Whether it is a sub-order, 1 indicates sub-order',
        is_parent        BIGINT COMMENT 'Whether it is a parent order, 1 indicates parent order',
        shop_id          BIGINT COMMENT 'Shop ID',
        total_fee        DECIMAL(38,18) COMMENT 'Sub-order fee after discounts and adjustments',
        is_large_order_flag BOOLEAN COMMENT 'Flag for large orders'
    )
    COMMENT 'Transaction order detail fact table, includes initial cleansing and business logic processing'
    PARTITIONED BY (pt STRING COMMENT 'Data timestamp, yyyymmdd')
    LIFECYCLE 365; -- Data time to live set to 365 days
    
    
    INSERT OVERWRITE TABLE dwd_trade_order PARTITION(pt='${bizdate}')
    SELECT
        MAX(id) AS id, -- Assume using the latest ID as the deduplication standard
        gmt_create,
        gmt_modified,
        sub_order_id,
        parent_order_id,
        buyer_id,
        COALESCE(buyer_nick, '') AS buyer_nick, -- Handle null buyer_nick
        item_id,
        item_price,
        buy_amount,
        biz_type,
        COALESCE(memo, '') AS memo, -- Handle null memo
        pay_status,
        logistics_status,
        status,
        seller_memo,
        buyer_memo,
        CASE 
            WHEN ip LIKE '__.__.__.__' THEN NULL -- Filter invalid IP formats
            ELSE ip 
        END AS clean_ip,
        end_time,
        pay_time,
        is_sub,
        is_parent,
        shop_id,
        total_fee,
        CASE 
            WHEN total_fee >= 10000 THEN TRUE -- Assume orders over 10000 cents are large orders
            ELSE FALSE 
        END AS is_large_order_flag -- Add business logic flag
    FROM (
        SELECT
            *,
            ROW_NUMBER() OVER(PARTITION BY buyer_id, item_id, gmt_create ORDER BY id DESC) AS rn -- Row number for deduplication
        FROM ods_trade_order
        WHERE pt = '${bizdate}'
    ) AS sub_query
    WHERE rn = 1 -- Keep only the first record of each deduplication group
    GROUP BY 
        gmt_create,
        gmt_modified,
        sub_order_id,
        parent_order_id,
        buyer_id,
        buyer_nick,
        item_id,
        item_price,
        buy_amount,
        biz_type,
        memo,
        pay_status,
        logistics_status,
        status,
        seller_memo,
        buyer_memo,
        clean_ip,
        end_time,
        pay_time,
        is_sub,
        is_parent,
        shop_id,
        total_fee,
        is_large_order_flag;
  3. Configure debugging parameters.

    Click Running Configurations on the right side of the MaxCompute SQL node editing page:

  4. Click Save in the node toolbar to save the node.

Configure dws_daily_category_sales

Aggregates the cleaned and standardized detail data from the DWD layer based on the dwd_trade_order table and dim_item_info table to produce the daily product category sales summary table dws_daily_category_sales.

  1. On the workflow orchestration page, hover over the dws_daily_category_sales node and click Open Node.

  2. Paste the following code into the node editing page.

    CREATE TABLE IF NOT EXISTS dws_daily_category_sales (
        cate_id             BIGINT COMMENT 'Product leaf category ID',
        cate_name           STRING COMMENT 'Product leaf category name',
        total_sales_amount  DECIMAL(38,18) COMMENT 'Total sales amount for the product category, in cents',
        order_count         BIGINT COMMENT 'Number of orders'
    )
    COMMENT 'Daily product category sales aggregate table'
    PARTITIONED BY (pt STRING COMMENT 'Data timestamp, yyyymmdd')
    LIFECYCLE 365;
    
    
    INSERT OVERWRITE TABLE dws_daily_category_sales PARTITION(pt='${bizdate}')
    SELECT
        i.cate_id,
        i.cate_name,
        SUM(t.total_fee) AS total_sales_amount,
        COUNT(DISTINCT t.id) AS order_count
    FROM dwd_trade_order t
    JOIN dim_item_info i ON t.item_id = i.item_id AND t.pt = i.pt
    WHERE t.pt = '${bizdate}'
    GROUP BY t.pt, i.cate_id, i.cate_name;
  3. Configure debugging parameters.

    Click Running Configurations on the right side of the MaxCompute SQL node editing page:

  4. Click Save in the node toolbar to save the node.

Configure ads_top_selling_categories

Produces the daily best-selling product category ranking table ads_top_selling_categories based on the dws_daily_category_sales table.

  1. On the workflow orchestration page, hover over the ads_top_selling_categories node and click Open Node.

  2. Paste the following code into the node editing page.

    CREATE TABLE IF NOT EXISTS ads_top_selling_categories (
        rank                BIGINT COMMENT 'Sales rank',
        cate_id             BIGINT COMMENT 'Product leaf category ID',
        cate_name           STRING COMMENT 'Product leaf category name',
        total_sales_amount  DECIMAL(38,18) COMMENT 'Total sales amount for the product category, in cents',
        order_count         BIGINT COMMENT 'Number of orders'
    )
    COMMENT 'Daily ranking table of best-selling product categories'
    PARTITIONED BY (pt STRING COMMENT 'Data timestamp, yyyymmdd');
    
    
    INSERT OVERWRITE TABLE ads_top_selling_categories PARTITION(pt='${bizdate}')
    SELECT
        rank,
        cate_id,
        cate_name,
        total_sales_amount,
        order_count
    FROM (
        SELECT
            DENSE_RANK() OVER(ORDER BY total_sales_amount DESC) AS rank,
            cate_id,
            cate_name,
            total_sales_amount,
            order_count
        FROM (
            SELECT
                cate_id,
                cate_name,
                SUM(total_sales_amount) AS total_sales_amount,
                SUM(order_count) AS order_count
            FROM dws_daily_category_sales
            WHERE pt = '${bizdate}'
            GROUP BY cate_id, cate_name
        ) agg_sub
    ) agg_outer
    WHERE rank <= 10;
  3. Configure debugging parameters.

    Click Running Configurations on the right side of the MaxCompute SQL node editing page:

  4. Click Save in the node toolbar to save the node.

Step 3: Debug and run

After the workflow configuration is complete, you need to run the entire workflow to verify the correctness of the configuration before deploying it to the production environment.

  1. In the left navigation pane of DataStudio, click image to enter the data development page. Then, in the Workspace Directories area, find the created workflow.

  2. Click Run in the node toolbar, and fill in Value Used in This Run with the date of the previous day (e.g., 20250416).

    Note

    In the workflow node configuration, DataWorks scheduling parameters have been used to implement dynamic code input. You need to assign a constant value to this parameter for testing during debugging.

  3. Click OK to enter the debug running page.

  4. Wait for the run to complete. The expected result is as follows:

    image

Step 4: Data query and visualization

You have processed the raw test data obtained from MySQL through data development and aggregated it into the table ads_top_selling_categories. Now you can query the table data to view the data analysis results.

  1. Click the image icon in the upper-left corner, and click All Products > Data Analysis > SQL Query in the pop-up page.

  2. Click image > Create File next to My Files, customize the File Name, and click OK.

  3. On the SQL Query page, configure the following SQL.

    SELECT * FROM ads_top_selling_categories WHERE pt=${bizdate};
  4. Select the MaxCompute data source in the upper-right corner and click OK.

  5. Click the Run button at the top, and click Run on the Cost Estimation page.

  6. Click image in the query results to view the visualized chart results. You can click image in the upper-right corner of the chart to customize the chart style.

  7. You can also click Save in the upper-right corner of the chart to save the chart as a card, and then click Card (image) in the left navigation pane to view it.

Step 5: Periodic scheduling

By completing the previous steps, you have obtained the sales data for various products from the previous day. However, if you need to obtain the latest sales data every day, you can deploy the workflow to the production environment to make it execute periodically at scheduled times.

Note

Scheduling-related parameters were configured for the workflow, synchronization nodes, and data processing nodes when configuring data synchronization and data processing. You do not need to configure them again here; simply deploy the workflow to the production environment. For more detailed information about scheduling configuration, see Node scheduling configuration.

  1. Click the image icon in the upper-left corner, and click All Products > Data Development and O&M > Data Studio in the pop-up page.

  2. In the left navigation pane of DataStudio, click image to enter the data development page, switch to the project space used in this case, and then find the created workflow in the Workspace Directories area.

  3. Click Deploy in the node toolbar. In the deployment panel, click Start Deployment to Production. Wait for Build Package and Prod Online Check to complete, and then click Deploy.

  4. After the Prod Online status becomes Complete, click Perform O&M to go to the Operation Center.

    image

  5. In Auto Triggered Node O&M > Auto Triggered Nodes, you can see the periodic tasks of the workflow (in this tutorial, the workflow is named dw_quickstart).

  6. To view the periodic task details of child nodes within the workflow, right-click the periodic task of the workflow and select View Internal Tasks.

    image

    The expected result is as follows:

    image

Next steps

Resource cleanup

If you need to release the resources created in this tutorial, the specific steps are as follows:

  1. Stop periodic tasks.

    1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Operation Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Operation Center.

    2. In Auto Triggered Node O&M > Auto Triggered Nodes, select all previously created periodic tasks (the workspace root node does not need to be taken offline), and then click More Actions > Undeploy at the bottom.

  2. Delete data development nodes and disassociate MaxCompute computing resources.

    1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

    2. In the left navigation pane of DataStudio, click image to enter the data development page. Then, in the Workspace Directoies area, find the created workflow, right-click the workflow, and click Delete.

    3. In the left navigation pane, click image > Computing Resources, find the associated MaxCompute computing resource, and click Disassociate. In the confirmation window, check the options and follow the instructions.

  3. Delete the MySQL data source.

    1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

    2. In the left navigation pane, click Data Sources. Find the created MySQL data source, click Delete in the Actions column, and follow the instructions to complete the deletion.

  4. Delete the MaxCompute project.

    Go to the MaxCompute Project Management page, find the created MaxCompute project, click Delete in the Actions column, and follow the instructions to complete the deletion.

  5. Delete the Internet NAT Gateway and release the Elastic IP Address.

    1. Go to the VPC - Internet NAT Gateway Console, and switch to the Singapore region in the top menu bar.

    2. Find the created Internet NAT Gateway, click image > Delete in the Actions column, check Force Delete in the confirmation window, and then click OK.

    3. In the left navigation pane, click Access to Internet > Elastic IP Addresses. Find the created Elastic IP Address, click image > Instance Management > Release in the Actions column, and click OK in the confirmation window.