All Products
Search
Document Center

DataWorks:Advanced: Analyze best-selling product categories for orders

Last Updated:Nov 19, 2025

DataWorks is a unified, end-to-end platform for big data development and governance. It integrates with big data engines such as MaxCompute, Hologres, EMR, AnalyticDB, and CDP and supports solutions such as data warehouses, data lakes, and data lakehouses. This topic provides an example of how to use the core features of DataWorks to integrate data, process business logic, schedule recurring tasks, and visualize data.

Introduction

This tutorial uses an E-commerce scenario as an example to show you how to build a complete data pipeline, from raw data integration to data analysis, calculation, and visualization. You will learn how to quickly build a reusable data production pipeline by following a standardized development process. This process ensures scheduling reliability and operational observability. This approach allows business personnel to transform data into value without requiring deep technical knowledge, which lowers the barrier for enterprises to adopt big data applications.

This tutorial helps you quickly learn how to perform the following operations:

  1. Data synchronization: Use the Data Integration module in DataWorks to create an offline sync task. This task synchronizes business data to a big data computing platform, such as MaxCompute.

  2. Data cleansing: Use the DataStudio module in DataWorks to process, analyze, and mine business data.

  3. Data visualization: Use the Data Analysis module in DataWorks to convert analysis results into charts that are easy for business personnel to understand.

  4. Recurring scheduling: Configure recurring schedules for the data synchronization and data cleansing workflows to run them at scheduled times.

image

This tutorial synchronizes raw product and order data from a public data source to MaxCompute. It then uses the following data analysis workflow to generate a daily ranking of the best-selling product categories:

image

Prerequisites

To complete the steps in this tutorial, you must use an Alibaba Cloud account or a RAM user that has the AliyunDataWorksFullAccess permission. For more information, see Prepare an Alibaba Cloud account or Prepare a RAM user.

Note

DataWorks provides a comprehensive access control mechanism that supports permission management at the product and module levels. For finer-grained access control, see Overview of the DataWorks permission model.

Preparations

Activate DataWorks

This tutorial uses the China (Shanghai) region as an example. Log on to the DataWorks console, switch to the China (Shanghai) region, and check whether DataWorks is activated in this region.

Note

This tutorial uses the China (Shanghai) region as an example. In practice, you should select the region where your business data is located:

  • If your business data is on other Alibaba Cloud services, select the same region where those services are deployed.

  • If your business is on-premises and requires internet access, select a region that is geographically close to your location to reduce access latency.

New users

If you are using DataWorks for the first time in the selected region, the following page is displayed. This indicates that DataWorks is not yet activated. Click 0-Yuan Combination Purchase.

image

  1. Configure the parameters on the combination purchase page.

    Parameter

    Description

    Example

    Region

    Select the region where you want to activate DataWorks.

    China (Shanghai)

    DataWorks Edition

    Select the DataWorks edition you want to purchase.

    Note

    This tutorial uses the Basic Edition as an example. All editions support the features covered in this tutorial. You can see Features of DataWorks editions to select the appropriate DataWorks edition as needed.

    Basic Edition

  2. Click Confirm Order And Pay and complete the payment.

Activated but expired

If you have previously activated DataWorks in the China (Shanghai) region but your subscription has expired, the following message is displayed. Click Purchase Edition.

image

  1. Configure the parameters on the purchase page.

    Parameter

    Description

    Example

    Edition

    Select the DataWorks edition you want to purchase.

    Note

    This tutorial uses the Basic Edition as an example. All editions support the features covered in this tutorial. You can see Features of DataWorks editions to select the appropriate DataWorks edition as needed.

    Basic Edition

    Region

    Select the region where you want to activate DataWorks.

    China (Shanghai)

  2. Click Buy Now and complete the payment.

Important

If you cannot find your purchased DataWorks edition, try the following:

  • Wait a few minutes and refresh the page to account for potential system update delays.

  • Ensure that your current region is the same as the region where you purchased the DataWorks edition. A region mismatch can prevent the edition from being displayed.

Already activated

If you have already activated DataWorks in the China (Shanghai) region, you will be taken to the DataWorks overview page. You can proceed to the next step.

Create a workspace

  1. Go to the DataWorks workspace list page, switch to the China (Shanghai) region, and click Create Workspace.

  2. On the Create Workspace page, enter a custom Workspace Name, enable Join The Public Preview Of DataStudio, and then click Create Workspace.

    Note

    After February 18, 2025, when an Alibaba Cloud account activates DataWorks and creates a workspace in the China (Shanghai) region for the first time, the new version of DataStudio is enabled by default. The Join The Public Preview Of DataStudio option will not be displayed.

Create a resource group and attach it to a workspace

  1. Go to the DataWorks resource group list page, switch to the China (Shanghai) region, and click Create Resource Group.

  2. On the resource group purchase page, configure the following parameters.

    Parameter

    Description

    Resource Group Name

    Custom.

    Virtual Private Cloud (VPC), V-Switch

    Select an existing VPC and vSwitch. If none are available in the current region, click the console links in the parameter description to create them.

    Service-linked Role

    Follow the on-screen instructions to create the AliyunServiceRoleForDataWorks service-linked role.

  3. Click Buy Now and complete the payment.

  4. Go to the DataWorks resource group list page, switch to the China (Shanghai) region, find the resource group you created, and click Attach Workspace in the Actions column.

  5. On the Attach Workspace page, find the DataWorks workspace you created and click Attach in its Actions column.

Enable Internet access for the resource group

The public test data from the E-commerce platform used in this tutorial must be accessed over the internet. The general-purpose resource group created in the previous step does not have internet access by default. To retrieve the data, you must enable internet access by configuring an Internet NAT gateway and adding an Elastic IP Address (EIP) for the VPC that is attached to the resource group.

  1. Log on to the VPC - Internet NAT Gateway console. In the top menu bar, switch to the China (Shanghai) region and click Create Internet NAT Gateway. Configure the relevant parameters.

    Note

    Keep the default values for parameters not mentioned in the table.

    Parameter

    Value

    Region

    China (Shanghai).

    Network And Zone

    Select the VPC and vSwitch attached to the resource group.

    You can go to the DataWorks resource group list page, switch to the China (Shanghai) region, find the resource group you created, and then click Network Settings in the Actions column. In the Data Scheduling & Data Integration area, view the Attached VPC and VSwitch. For more information about VPC and vSwitches, see What is a VPC?.

    Network Type

    Internet NAT Gateway.

    Elastic IP Address

    Purchase New Elastic IP Address.

    Create Associated Role

    When you create a NAT Gateway for the first time, you need to create a service-linked role. Click Create Associated Role.

  2. Click Buy Now and complete the payment to create the NAT Gateway instance.

    image

  3. After you purchase the NAT Gateway instance, return to the console and create an SNAT entry for the new instance.

    Note

    The resource group can access the internet only after an SNAT entry is configured for its VPC.

    1. Click Manage in the Actions column of the new instance to go to the management page of the target NAT Gateway instance, and switch to the SNAT Management tab.

    2. Under SNAT Entry List, click Create SNAT Entry. The key configuration parameters are as follows:

      Parameter

      Value

      SNAT Entry Granularity

      Select VPC Granularity to ensure that all resource groups within the VPC where the NAT gateway resides can access the Internet through the configured elastic IP address.

      Select Elastic IP Address

      Configure the elastic IP address attached to the current NAT Gateway instance.

      After configuring the SNAT entry parameters, click the Confirm button to create the SNAT entry.

    In the SNAT Entry List, when the Status of a newly created SNAT entry changes to Active, the VPC attached to the resource group gains Internet access.

Create and attach a MaxCompute computing resource

This tutorial requires you to create a MaxCompute project and attach it as a DataWorks computing resource. This resource will be used to receive data and perform big data analysis.

  1. Go to the DataWorks workspace list page, switch to the China (Shanghai) region, find the workspace you created, and click the workspace name to go to the Workspace Details page.

  2. In the navigation pane on the left, click Computing Resources. On the computing resources page, click Attach Computing Resource and select the MaxCompute type. Configure the following key parameters to create a MaxCompute project and attach it as a DataWorks computing resource.

    Note

    Keep the default values for parameters not mentioned in the table.

    Parameter

    Description

    MaxCompute Project

    In the drop-down list, click Create and fill in the following parameters.

    • Project Name: Custom and globally unique.

    • Billing Method For Computing Resources: Select Pay-as-you-go.

      Note

      If pay-as-you-go is not selectable, click Activate after Pay-as-you-go to activate the MaxCompute service.

    • Default Quota: Select the default existing quota from the drop-down list.

    Default Access Identity

    Select Alibaba Cloud Account.

    Computing Resource Instance Name

    During subsequent task runtime, the computing resource instance name is used to select the computing resource for the task. Use a name that is easy to identify. For example, this tutorial names it MaxCompute_Source.

  3. Click Confirm.

Procedure

This section uses the following scenario as an example to guide you through the features of DataWorks:

Assume an E-commerce platform stores product and order information in a MySQL database. You need to periodically analyze the order data and view a daily ranking of the best-selling product categories in a visual format.

1. Data synchronization

Create a data source

DataWorks connects to data sources and destinations using data source connections. In this step, you need to create a MySQL data source to connect to the source MySQL database that stores the business data. This data source provides the raw data for this tutorial.

Note

You do not need to prepare the raw business data for this tutorial. For testing and learning purposes, DataWorks provides a test dataset. The table data is stored in a public MySQL database that you can connect to by creating a MySQL data source.

  1. Go to the DataWorks Management Center page, switch to the China (Shanghai) region, select the workspace you created from the drop-down list, and then click Go To Management Center.

  2. In the navigation pane on the left, click Data Source to go to the Data Source List page. Click Add Data Source, select the MySQL type, and configure the MySQL data source parameters.

    Note
    • Keep the default values for parameters not mentioned in the table.

    • The first time you add a data source, you are prompted to complete Cross-service Authorization. Follow the on-screen instructions to grant the AliyunDIDefaultRole service-linked role.

    Parameter

    Description

    Data Source Name

    This example uses MySQL_Source.

    Configuration Mode

    Select Connection String Mode.

    Connection Address

    • Host IP: rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com

    • Port: 3306.

    Important

    The data provided in this tutorial is for hands-on practice with DataWorks only. All data is test data and can only be read in the Data Integration module.

    Database Name

    Set to retail_e_commerce.

    Username

    Enter the username workshop.

    Password

    Enter the password workshop#2017.

  3. In the Connection Configuration area, switch to the Data Integration tab, find the resource group attached to the workspace, and click Test Connectivity in the Connectivity Status column.

    Note

    If the MySQL data source connectivity test fails, perform the following steps:

    • Follow the instructions in the connectivity diagnosis tool.

    • Check whether an EIP is configured for the VPC that is attached to the resource group. The MySQL data source requires the resource group to have internet access. For more information, see Enable internet access for the resource group.

  4. Click Complete.

Build a synchronization pipeline

In this step, you build a synchronization pipeline to sync E-commerce product and order data to MaxCompute tables. This prepares the data for subsequent processing.

  1. Click the 图标 icon in the upper-left corner and choose All Products > Data Development And O&M > DataStudio (Data Development) to go to the data development page.

  2. At the top of the page, switch to the workspace created for this tutorial. In the navigation pane on the left, click image to go to the data development - project directory page.

  3. In the Project Directory area, click image, select New Workflow, and set the workflow name. This tutorial sets it to dw_quickstart.

  4. On the workflow orchestration page, drag a Zero Load Node and an Offline Synchronization node from the left onto the canvas, and then set their names.

    The example node names and their functions in this tutorial are as follows:

    Node Type

    Node Name

    Node Function

    image Virtual node

    workshop

    Used to manage the entire user persona analysis workflow, making the data forwarding path clearer. This node is a Dry-run Task and does not require code editing.

    imageOffline Synchronization Node

    ods_item_info

    Used to synchronize the product information source table item_info from MySQL to the MaxCompute table ods_item_info.

    imageOffline Synchronization Node

    ods_trade_order

    Used to synchronize the order information source table trade_order from MySQL to the MaxCompute table ods_trade_order.

    Manually connect the nodes by dragging lines between them. Set the workshop node as the ancestor node for the two offline synchronization nodes. The result should look like the following figure:

    image
  5. Workflow scheduling configuration.

    In the right-side panel of the workflow orchestration page, click Scheduling Configuration and configure the parameters. This tutorial requires you to configure the following key parameters. You can keep the default values for all other parameters.

    Scheduling Configuration Parameter

    Description

    Scheduling Parameters

    Set scheduling parameters for the entire workflow. Inner nodes in the workflow can use them directly.

    This tutorial configures it as bizdate=$[yyyymmdd-1] to get the previous day's date.

    Note

    DataWorks provides scheduling parameters that allow for dynamic parameter input in code. You can define variables in your SQL code using the format ${variable_name} and assign values to these variables in Scheduling Configuration > Scheduling Parameters. For more information about supported formats for scheduling parameters, see Supported formats for scheduling parameters.

    Scheduling Cycle

    This tutorial configures it as Daily.

    Scheduling Time

    This tutorial sets the Scheduling Time to 00:30. The workflow will start at 00:30 every day.

    Recurring Dependency

    The workflow has no upstream dependency, so this can be left unconfigured. For unified management, you can click Use Workspace Root Node to attach the workflow to the workspace root node.

    The workspace root node is named in the format: workspace_name_root.

Configure synchronization tasks

Configure the initial node
  1. On the workflow orchestration page, hover your mouse over the workshop node and click Open Node.

  2. In the right-side panel of the workshop node editing page, click Scheduling Configuration and configure the parameters. This tutorial requires you to configure the following key parameters. You can keep the default values for all other parameters.

    Scheduling Configuration Parameter

    Description

    Scheduling Type

    This tutorial sets it to Dry-run Scheduling.

    Scheduling Resource Group

    This tutorial sets it to the Serverless resource group created in Create a resource group and attach it to a workspace.

    Node Dependency Configuration

    Because workshop is the initial node and has no upstream dependency, you can click Use Workspace Root Node to have the workspace root node trigger the workflow execution.

    The workspace root node is named: workspace_name_root.

  3. In the node toolbar, click Save to save the node.

Configure the product information synchronization pipeline (ods_item_info)
  1. On the workflow orchestration page, hover your mouse over the ods_item_info node and click Open Node.

  2. Configure the network and resource settings for the synchronization pipeline.

    Parameter

    Description

    Data Source

    • Data Source: MySQL.

    • Data Source Name: MySQL_Source.

    My Resource Group

    Select the Serverless resource group purchased in Create a resource group and attach it to a workspace.

    Data Destination

  3. Click Next to configure the synchronization task.

    1. Configure Data Source And Destination

      Note

      Keep the default values for parameters not mentioned in the table.

      Configuration Area

      Parameter

      Configuration Description

      Data Source

      Table

      item_info.

      Data Destination

      Table

      Click One-click Generate Target Table Schema to quickly create a MaxCompute table. Paste the following table creation statement into the Table Creation Statement area and click Create Table. This table is used to receive product information from the data source.

      Table Creation SQL

      CREATE TABLE IF NOT EXISTS ods_item_info(
      `id`                            BIGINT COMMENT '',
      `cate_id`                       BIGINT COMMENT '',
      `cate_name`                     STRING COMMENT '',
      `commodity_id`                  BIGINT COMMENT '',
      `commodity_name`                STRING COMMENT '',
      `desc_path`                     STRING COMMENT '',
      `duration`                      BIGINT COMMENT '',
      `features`                      STRING COMMENT '',
      `gmt_create`                    DATETIME COMMENT '',
      `gmt_modified`                  DATETIME COMMENT '',
      `is_deleted`                    BIGINT COMMENT '',
      `is_virtual`                    STRING COMMENT '',
      `item_id`                       BIGINT COMMENT '',
      `item_status`                   BIGINT COMMENT '',
      `last_offline_time`             DATETIME COMMENT '',
      `last_online_quantity`          BIGINT COMMENT '',
      `last_online_time`              DATETIME COMMENT '',
      `pict_url`                      STRING COMMENT '',
      `reserve_price`                 DECIMAL(38,18) COMMENT '',
      `secure_trade_ems_post_fee`     DECIMAL(38,18) COMMENT '',
      `secure_trade_fast_post_fee`    DECIMAL(38,18) COMMENT '',
      `secure_trade_ordinary_post_fee`DECIMAL(38,18) COMMENT '',
      `shop_id`                       BIGINT COMMENT '',
      `shop_nick`                     STRING COMMENT '',
      `sub_title`                     STRING COMMENT '',
      `title`                         STRING COMMENT ''
      )
      COMMENT ''
      PARTITIONED BY (pt STRING) 
      lifecycle 36500;

      Partition Information

      This tutorial enters ${bizdate}. This is used to assign a constant value to the bizdate parameter during the testing phase and to dynamically assign a value to the bizdate parameter during scheduled execution. For more information about the variable formats and configuration methods supported by DataStudio, see Scheduling parameters.

    2. Confirm Field Mapping and Channel Control.

      DataWorks writes data from source fields to destination fields based on the configured field mappings. You can also configure settings such as task concurrency and dirty data policies. In this tutorial, set the Dirty Data Policy to Do Not Tolerate Dirty Data. You can keep the default values for all other settings. For more information, see Configure an offline sync task in the codeless UI.

  4. In the node toolbar, click Save to save the node.

Configure the order data synchronization pipeline (ods_trade_order)
  1. On the workflow orchestration page, hover your mouse over the ods_trade_order node and click Open Node.

  2. Configure the network and resource settings for the synchronization pipeline.

    Parameter

    Description

    Data Source

    • Data Source: MySQL.

    • Data Source Name: MySQL_Source.

    My Resource Group

    Select the Serverless resource group purchased in Create a resource group and attach it to a workspace.

    Data Destination

  3. Click Next to configure the synchronization task.

    1. Configure Data Source And Destination

      Note

      Keep the default values for parameters not mentioned in the table.

      Configuration Area

      Parameter

      Configuration Description

      Data Source

      Table

      trade_order.

      Data Destination

      Table

      Click One-click Generate Target Table Schema to quickly create a MaxCompute table. Paste the following table creation statement into the Table Creation Statement area and click Create Table. This table is used to receive product information from the data source.

      Table Creation SQL

      CREATE TABLE IF NOT EXISTS ods_trade_order(
      `id`                            BIGINT COMMENT '',
      `biz_type`                      BIGINT COMMENT '',
      `buy_amount`                    BIGINT COMMENT '',
      `buyer_id`                      BIGINT COMMENT '',
      `buyer_memo`                    STRING COMMENT '',
      `buyer_nick`                    STRING COMMENT '',
      `end_time`                      DATETIME COMMENT '',
      `gmt_create`                    DATETIME COMMENT '',
      `gmt_modified`                  DATETIME COMMENT '',
      `ip`                            STRING COMMENT '',
      `is_parent`                     BIGINT COMMENT '',
      `is_sub`                        BIGINT COMMENT '',
      `item_id`                       BIGINT COMMENT '',
      `item_price`                    DECIMAL(38,18) COMMENT '',
      `logistics_status`              BIGINT COMMENT '',
      `memo`                          STRING COMMENT '',
      `parent_order_id`               BIGINT COMMENT '',
      `pay_status`                    BIGINT COMMENT '',
      `pay_time`                      DATETIME COMMENT '',
      `seller_memo`                   STRING COMMENT '',
      `shop_id`                       BIGINT COMMENT '',
      `status`                        BIGINT COMMENT '',
      `sub_order_id`                  BIGINT COMMENT '',
      `total_fee`                     DECIMAL(38,18) COMMENT ''
      )
      COMMENT ''
      PARTITIONED BY (pt STRING) 
      lifecycle 36500;

      Partition Information

      This tutorial enters ${bizdate}. This is used to assign a constant value to the bizdate parameter during the testing phase and to dynamically assign a value to the bizdate parameter during scheduled execution. For more information about the variable formats and configuration methods supported by DataStudio, see Scheduling parameters.

    2. Confirm Field Mapping and Channel Control.

      DataWorks writes data from source fields to destination fields based on the configured field mappings. You can also configure settings such as task concurrency and dirty data policies. In this tutorial, set the Dirty Data Policy to Do Not Tolerate Dirty Data. You can keep the default values for all other settings. For more information, see Configure an offline sync task in the codeless UI.

  4. In the node toolbar, click Save to save the node.

2. Data cleansing

After the data is synchronized from MySQL to MaxCompute, you have two data tables: the product information table ods_item_info and the order information table ods_trade_order. You can use the DataStudio module in DataWorks to cleanse, process, and analyze the data in these tables to generate a daily ranking of the best-selling product categories.

Build a data transformation pipeline

  1. In the navigation pane on the left of DataStudio, click image to go to the data development page. Then, in the Project Directory area, find the workflow that you created, open the workflow orchestration page, and drag MaxCompute SQL nodes from the component list on the left to the canvas. Then, name the nodes.

    The example node names and their functions in this tutorial are as follows:

    Node Type

    Node Name

    Node Function

    imageMaxCompute SQL

    dim_item_info

    Based on the ods_item_info table, processes product dimension data to produce the product basic information dimension table dim_item_info.

    imageMaxCompute SQL

    dwd_trade_order

    Based on the ods_trade_order table, performs initial cleansing, transformation, and business logic processing on detailed transaction data of orders to produce the transaction order detail fact table dwd_trade_order.

    imageMaxCompute SQL

    dws_daily_category_sales

    Based on the dwd_trade_order table and the dim_item_info table, aggregates the cleansed and standardized detailed data from the DWD layer to produce the daily product category sales aggregate table dws_daily_category_sales.

    imageMaxCompute SQL

    ads_top_selling_categories

    Based on the dws_daily_category_sales table, produces the daily ranking table of best-selling product categories ads_top_selling_categories.

  2. Manually connect the nodes by dragging lines between them and configure the ancestor node for each node. The result should look like the following figure:

    image
    Note

    In a workflow, you can set the upstream and downstream dependencies of each node by manually connecting them. You can also use code parsing in a child node to automatically identify node dependencies. This tutorial uses the manual connection method. For more information about code parsing, see Automatic dependency parsing.

Configure data transformation nodes

Configure the dim_item_info node

This node processes product dimension data from the ods_item_info table to produce the product dimension table dim_item_info.

  1. On the workflow orchestration page, hover your mouse over the dim_item_info node and click Open Node.

  2. Paste the following code into the node editing page.

    CREATE TABLE IF NOT EXISTS dim_item_info (
        gmt_modified                   STRING COMMENT 'Product last modified date',
        gmt_create                     STRING COMMENT 'Product creation time',
        item_id                        BIGINT COMMENT 'Product numeric ID',
        title                          STRING COMMENT 'Product title',
        sub_title                      STRING COMMENT 'Product subtitle',
        pict_url                       STRING COMMENT 'Main image URL',
        desc_path                      STRING COMMENT 'Path of product description',
        item_status                    BIGINT COMMENT 'Product status 1: Confirmed 0: Not confirmed',
        last_online_time               DATETIME COMMENT 'Last time sales started, product listing time',
        last_offline_time              DATETIME COMMENT 'Sales end time, indicates the end of a sales cycle, only for auction items',
        duration                       BIGINT COMMENT 'Validity period, sales cycle, only two values, 7 days or 14 days',
        reserve_price                  DOUBLE COMMENT 'Current price',
        secure_trade_ordinary_post_fee DOUBLE COMMENT 'Standard mail fee',
        secure_trade_fast_post_fee     DOUBLE COMMENT 'Express mail fee',
        secure_trade_ems_post_fee      DOUBLE COMMENT 'EMS mail fee',
        last_online_quantity           BIGINT COMMENT 'Inventory quantity when the product was last listed',
        features                       STRING COMMENT 'Product features',
        cate_id                        BIGINT COMMENT 'Product leaf category ID',
        cate_name                      STRING COMMENT 'Product leaf category name',
        commodity_id                   BIGINT COMMENT 'Category ID',
        commodity_name                 STRING COMMENT 'Category name',
        is_virtual                     STRING COMMENT 'Whether it is a virtual product',
        shop_id                        BIGINT COMMENT 'Shop ID',
        shop_nick                      STRING COMMENT 'Shop NICK',
        is_deleted                     BIGINT COMMENT 'Whether the category is deleted'
    )
    COMMENT 'Product basic information dimension table'
    PARTITIONED BY (pt STRING COMMENT 'Data timestamp, yyyymmdd')
    LIFECYCLE 365;
    
    
    -- Insert data into the dim_item_info table
    INSERT OVERWRITE TABLE dim_item_info PARTITION(pt='${bizdate}')
    SELECT
        gmt_create,
        gmt_modified,
        item_id,
        title,
        sub_title,
        pict_url,
        desc_path,
        item_status,
        last_online_time,
        last_offline_time,
        duration,
        cast(reserve_price as DOUBLE),
        cast(secure_trade_ordinary_post_fee as DOUBLE),
        cast(secure_trade_fast_post_fee as DOUBLE),
        cast(secure_trade_ems_post_fee as DOUBLE),
        last_online_quantity,
        features,
        cate_id,
        cate_name,
        commodity_id,
        commodity_name,
        is_virtual,
        shop_id,
        shop_nick,
        is_deleted
    FROM ods_item_info
    WHERE pt = '${bizdate}';
  3. Configure the debug parameters.

    In the right-side panel of the MaxCompute SQL node editing page, click Debug Configuration:

  4. In the node toolbar, click Save to save the node.

Configure the dwd_trade_order node

This node performs initial cleansing, transformation, and business logic processing on detailed transaction data from the ods_trade_order table to produce the transaction order detail fact table dwd_trade_order.

  1. On the workflow orchestration page, hover your mouse over the dwd_trade_order node and click Open Node.

  2. Paste the following code into the node editing page.

    CREATE TABLE IF NOT EXISTS dwd_trade_order (
        id               BIGINT COMMENT 'Primary key, latest ID after deduplication',
        gmt_create       DATETIME COMMENT 'Creation time',
        gmt_modified     DATETIME COMMENT 'Modification time',
        sub_order_id     BIGINT COMMENT 'Sub-order ID',
        parent_order_id  BIGINT COMMENT 'Parent order ID',
        buyer_id         BIGINT COMMENT 'Buyer numeric ID',
        buyer_nick       STRING COMMENT 'Buyer nickname, handles null values',
        item_id          BIGINT COMMENT 'Product numeric ID',
        item_price       DECIMAL(38,18) COMMENT 'Product price, in cents',
        buy_amount       BIGINT COMMENT 'Purchase quantity',
        biz_type         BIGINT COMMENT 'Transaction type',
        memo             STRING COMMENT 'Memo, handles null values',
        pay_status       BIGINT COMMENT 'Payment status',
        logistics_status BIGINT COMMENT 'Logistics status',
        status           BIGINT COMMENT 'Status',
        seller_memo      STRING COMMENT 'Seller memo for the transaction',
        buyer_memo       STRING COMMENT 'Buyer memo for the transaction',
        clean_ip         STRING COMMENT 'Cleansed buyer IP, filters invalid formats',
        end_time         DATETIME COMMENT 'Transaction end time',
        pay_time         DATETIME COMMENT 'Payment time',
        is_sub           BIGINT COMMENT 'Whether it is a sub-order, 1 indicates sub-order',
        is_parent        BIGINT COMMENT 'Whether it is a parent order, 1 indicates parent order',
        shop_id          BIGINT COMMENT 'Shop ID',
        total_fee        DECIMAL(38,18) COMMENT 'Sub-order fee after discounts and adjustments',
        is_large_order_flag BOOLEAN COMMENT 'Flag for large orders'
    )
    COMMENT 'Transaction order detail fact table, includes initial cleansing and business logic processing'
    PARTITIONED BY (pt STRING COMMENT 'Data timestamp, yyyymmdd')
    LIFECYCLE 365; -- Data time to live set to 365 days
    
    
    INSERT OVERWRITE TABLE dwd_trade_order PARTITION(pt='${bizdate}')
    SELECT
        MAX(id) AS id, -- Assume using the latest ID as the deduplication standard
        gmt_create,
        gmt_modified,
        sub_order_id,
        parent_order_id,
        buyer_id,
        COALESCE(buyer_nick, '') AS buyer_nick, -- Handle null buyer_nick
        item_id,
        item_price,
        buy_amount,
        biz_type,
        COALESCE(memo, '') AS memo, -- Handle null memo
        pay_status,
        logistics_status,
        status,
        seller_memo,
        buyer_memo,
        CASE 
            WHEN ip LIKE '__.__.__.__' THEN NULL -- Filter invalid IP formats
            ELSE ip 
        END AS clean_ip,
        end_time,
        pay_time,
        is_sub,
        is_parent,
        shop_id,
        total_fee,
        CASE 
            WHEN total_fee >= 10000 THEN TRUE -- Assume orders over 10000 cents are large orders
            ELSE FALSE 
        END AS is_large_order_flag -- Add business logic flag
    FROM (
        SELECT
            *,
            ROW_NUMBER() OVER(PARTITION BY buyer_id, item_id, gmt_create ORDER BY id DESC) AS rn -- Row number for deduplication
        FROM ods_trade_order
        WHERE pt = '${bizdate}'
    ) AS sub_query
    WHERE rn = 1 -- Keep only the first record of each deduplication group
    GROUP BY 
        gmt_create,
        gmt_modified,
        sub_order_id,
        parent_order_id,
        buyer_id,
        buyer_nick,
        item_id,
        item_price,
        buy_amount,
        biz_type,
        memo,
        pay_status,
        logistics_status,
        status,
        seller_memo,
        buyer_memo,
        clean_ip,
        end_time,
        pay_time,
        is_sub,
        is_parent,
        shop_id,
        total_fee,
        is_large_order_flag;
  3. Configure the debug parameters.

    In the right-side panel of the MaxCompute SQL node editing page, click Debug Configuration:

  4. In the node toolbar, click Save to save the node.

Configure the dws_daily_category_sales node

This node aggregates the cleansed and standardized detailed data from the dwd_trade_order table and the dim_item_info table to produce the daily product category sales aggregate table dws_daily_category_sales.

  1. On the workflow orchestration page, hover your mouse over the dws_daily_category_sales node and click Open Node.

  2. Paste the following code into the node editing page.

    CREATE TABLE IF NOT EXISTS dws_daily_category_sales (
        cate_id             BIGINT COMMENT 'Product leaf category ID',
        cate_name           STRING COMMENT 'Product leaf category name',
        total_sales_amount  DECIMAL(38,18) COMMENT 'Total sales amount for the product category, in cents',
        order_count         BIGINT COMMENT 'Number of orders'
    )
    COMMENT 'Daily product category sales aggregate table'
    PARTITIONED BY (pt STRING COMMENT 'Data timestamp, yyyymmdd')
    LIFECYCLE 365;
    
    
    INSERT OVERWRITE TABLE dws_daily_category_sales PARTITION(pt='${bizdate}')
    SELECT
        i.cate_id,
        i.cate_name,
        SUM(t.total_fee) AS total_sales_amount,
        COUNT(DISTINCT t.id) AS order_count
    FROM dwd_trade_order t
    JOIN dim_item_info i ON t.item_id = i.item_id AND t.pt = i.pt
    WHERE t.pt = '${bizdate}'
    GROUP BY t.pt, i.cate_id, i.cate_name;
  3. Configure the debug parameters.

    In the right-side panel of the MaxCompute SQL node editing page, click Debug Configuration:

  4. In the node toolbar, click Save to save the node.

Configure the ads_top_selling_categories node

This node produces the daily ranking table of best-selling product categories, ads_top_selling_categories, based on the dws_daily_category_sales table.

  1. On the workflow orchestration page, hover your mouse over the ads_top_selling_categories node and click Open Node.

  2. Paste the following code into the node editing page.

    CREATE TABLE IF NOT EXISTS ads_top_selling_categories (
        rank                BIGINT COMMENT 'Sales rank',
        cate_id             BIGINT COMMENT 'Product leaf category ID',
        cate_name           STRING COMMENT 'Product leaf category name',
        total_sales_amount  DECIMAL(38,18) COMMENT 'Total sales amount for the product category, in cents',
        order_count         BIGINT COMMENT 'Number of orders'
    )
    COMMENT 'Daily ranking table of best-selling product categories'
    PARTITIONED BY (pt STRING COMMENT 'Data timestamp, yyyymmdd');
    
    
    INSERT OVERWRITE TABLE ads_top_selling_categories PARTITION(pt='${bizdate}')
    SELECT
        rank,
        cate_id,
        cate_name,
        total_sales_amount,
        order_count
    FROM (
        SELECT
            DENSE_RANK() OVER(ORDER BY total_sales_amount DESC) AS rank,
            cate_id,
            cate_name,
            total_sales_amount,
            order_count
        FROM (
            SELECT
                cate_id,
                cate_name,
                SUM(total_sales_amount) AS total_sales_amount,
                SUM(order_count) AS order_count
            FROM dws_daily_category_sales
            WHERE pt = '${bizdate}'
            GROUP BY cate_id, cate_name
        ) agg_sub
    ) agg_outer
    WHERE rank <= 10;
  3. Configure the debug parameters.

    In the right-side panel of the MaxCompute SQL node editing page, click Debug Configuration:

  4. In the node toolbar, click Save to save the node.

3. Debug and run

After you configure the workflow, you must run the entire workflow to verify that the configuration is correct before you publish it to the production environment.

  1. In the navigation pane on the left of DataStudio, click image to go to the data development page. Then, in the Project Directory area, find the workflow you created.

  2. In the toolbar, click Run. For Current Run Value, enter the date for the day before the current date (for example, 20250416).

    Note

    In the workflow node configuration, scheduling parameters provided by DataWorks are used to enable dynamic parameter input in the code. When you debug the node, you must assign a constant value to this parameter for testing.

  3. Click OK to go to the debug run page.

  4. Wait for the run to complete. The expected result is shown in the following figure:

    image

4. Query and visualize data

You have processed the raw test data from MySQL and aggregated it into the ads_top_selling_categories table. You can now query this table to view the analysis results.

  1. Click the image icon in the upper-left corner. In the pop-up page, click All Products > Data Analysis > SQL Query.

  2. After My Files, click image > New File. Enter a custom File Name and click OK.

  3. On the SQL Query page, configure the following SQL.

    SELECT * FROM ads_top_selling_categories WHERE pt=${bizdate};
  4. In the upper-right corner, select the MaxCompute data source and click OK.

  5. Click the Run button at the top. On the Cost Estimation page, click Run.

  6. In the query results, click image to view the results as a chart. You can click image in the upper-right corner of the chart to customize the chart style.

  7. You can also click Save in the upper-right corner of the chart to save it as a card. Then, in the navigation pane on the left, you can click Card (image) to view the saved card.

5. Recurring scheduling

By completing the previous steps, you have obtained the sales data for various product categories from the previous day. To retrieve the latest sales data every day, you can publish the workflow to the production environment to run it periodically at a scheduled time.

Note

When you configured the data synchronization and data transformation tasks, you also configured the scheduling parameters for the workflow and its nodes. You do not need to configure them again. You only need to publish the workflow to the production environment. For more information about scheduling configurations, see Node scheduling configuration.

  1. Click the image icon in the upper-left corner. In the pop-up page, click All Products > Data Development And O&M > DataStudio (Data Development).

  2. In the navigation pane on the left of DataStudio, click image to go to the data development page. Switch to the project space used in this case. Then, in the Project Directory area, find the workflow you created.

  3. In the toolbar, click Publish. In the publish panel, click Start Publishing To Production. Wait for the Deployment Package Build and Production Checker steps to complete, and then click Confirm Publish.

  4. After the Publish To Production Environment status changes to Completed, click Go To O&M to go to the Operation Center.

    image

  5. In Auto Triggered Task O&M > Auto Triggered Task, you can see the auto-triggered task for the workflow (named dw_quickstart in this tutorial).

  6. To view the details of the auto-triggered tasks for the child nodes in the workflow, right-click the auto-triggered task for the workflow and select View Inner Tasks.

    image

    The expected result is as follows:

    image

What to do next

Appendix: Release and clean up resources

To release the resources created in this tutorial, perform the following steps:

  1. Stop the auto-triggered tasks.

    1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Operation Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Operation Center.

    2. On the Auto Triggered Task O&M > Auto Triggered Task page, select all the auto-triggered tasks that you created. The workspace root node does not need to be unpublished. Then, at the bottom of the page, click Operation > Unpublish Node.

  2. Delete the data development nodes and detach the MaxCompute computing resource.

    1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

    2. In the navigation pane on the left of DataStudio, click image to go to the data development page. Then, in the Project Directory area, find the workflow you created, right-click the workflow, and click Delete.

    3. In the navigation pane on the left, click image > Computing Resource Management, find the attached MaxCompute computing resource, and click Detach. In the confirmation window, select the checkbox and follow the on-screen instructions to complete the detachment.

  3. Delete the MySQL data source.

    1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.

    2. In the navigation pane on the left, click Data Source to go to the Data Source List page. Find the MySQL data source you created, click Delete in the Actions column, and follow the instructions to complete the deletion.

  4. Delete the MaxCompute project.

    Go to the MaxCompute Project Management page, find the MaxCompute project that you created, click Delete in the Actions column, and follow the on-screen instructions to complete the deletion.

  5. Delete the Internet NAT gateway and release the elastic IP address.

    1. Go to the VPC - Internet NAT Gateway console. In the top menu bar, switch to the China (Shanghai) region.

    2. Find the Internet NAT gateway that you created. In the Actions column, click image > Delete. In the confirmation window, select Force Delete and then click OK.

    3. In the navigation pane on the left, click Internet Access > Elastic IP Address. Find the EIP that you created. In the Actions column, click image > Instance Management > Release. In the confirmation window, click OK.