This tutorial covers data ingestion, processing, scheduling, and visualization using core DataWorks features.
Introduction
This tutorial demonstrates how to build a data pipeline—from raw data ingestion to analysis and visualization—using an e-commerce scenario. A standardized process helps you quickly build reusable data flows with reliable scheduling and observability. This lowers the barrier for big data applications, enabling business users to extract value without managing technical details.
You will perform the following tasks:
Data synchronization: Create a batch synchronization task in Data Integration to move business data to a compute platform like MaxCompute.
Data cleaning: Clean, analyze, and mine data in Data Studio.
Data visualization: Visualize analysis results in Data Analysis for easier business interpretation.
Scheduling: Schedule synchronization and cleaning tasks to run automatically.

You will synchronize raw product and order data from a public source to MaxCompute, then analyze it to generate a daily ranking of best-selling categories:
Prerequisites
Use an Alibaba Cloud account or a RAM user with the AliyunDataWorksFullAccess permission. For more information, see Prepare an Alibaba Cloud account or Prepare a RAM user.
DataWorks supports granular permission control at product and module levels. For details, see Overview of the DataWorks permission management system.
Preparation
Activate DataWorks
Create a workspace
Create and associate resources
Enable public network access
Associate MaxCompute resources
Procedure
This tutorial uses the following scenario as an example to guide you through the quick experience of DataWorks features:
An e-commerce platform stores product and order data in MySQL. The goal is to analyze order data and visualize daily rankings of best-selling categories.
Step 1: Data synchronization
Create a data source
Create a MySQL data source to connect to the database hosting the sample data.
DataWorks provides a public MySQL database with sample data. You do not need to prepare raw data. The relevant table data is stored in a public MySQL database, and you only need to create a MySQL data source to access it.
Go to the DataWorks Management Center page, switch to the Singapore region, select the created workspace from the drop-down box, and click Go to Management Center.
In the left navigation pane, click Data Sources. Click Add Data Source, select the MySQL type, and configure the MySQL data source parameters.
NoteRetain default values for parameters not listed.
First-time users must complete cross-service authorization. Follow the prompts to authorize AliyunDIDefaultRole.
Parameter
Description
Data Source Name
In this example, it is MySQL_Source.
Configuration Mode
Select Connection String Mode.
Endpoint
Host Address IP:
rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.comPort Number:
3306.
ImportantThe data provided in this tutorial is solely for practicing data applications on the DataWorks. All data is test data and only supports reading in the Data Integration module.
Database Name
Set to
retail_e_commerce.Username
Enter the username
workshop.Password
Enter the password
workshop#2017.In the Connection Configuration section, switch to the Data Integration tab, find the resource group associated with the workspace, and click Test Network Connectivity in the Connectivity Status column.
NoteIf the MySQL data source connectivity test fails, perform the following operations:
Complete the follow-up operations of the connectivity diagnostic tool.
Check if an EIP is configured for the VPC bound to the resource group, as the MySQL data source requires the resource group to have public network access capability. For details, see Enable public network access.
Click Complete Creation.
Build a synchronization pipeline
In this step, you need to build a synchronization pipeline to synchronize product order data from the e-commerce platform to tables in MaxCompute to prepare for subsequent data processing.
Click the
icon in the upper-left corner and select to enter the data development page.Switch to the workspace created in this tutorial at the top of the page, and click
in the left navigation pane to enter the Workspace Directories page.In the Workspace Directories area, click
, select Create Workflow, and set the workflow name. In this tutorial, it is set to dw_quickstart.On the workflow orchestration page, drag Zero Load and Batch Synchronization nodes from the left side to the canvas, and set the node names respectively.
The node names and functions are described below:
Node Type
Node Name
Node Function
Zero LoadworkshopUsed to manage the entire user profile analysis workflow, making the data flow path clearer. This node is a dry run task and does not require code editing.
Batch Synchronization Nodeods_item_infoUsed to synchronize the product information source table
item_infostored in MySQL to theods_item_infotable in MaxCompute.
Batch Synchronization Nodeods_trade_orderUsed to synchronize the order information source table
trade_orderstored in MySQL to theods_trade_ordertable in MaxCompute.Manually drag lines to set the
workshopnode as the upstream node for the two batch synchronization nodes. The final effect is as follows:Workflow scheduling configuration.
Click Scheduling on the right side of the workflow orchestration page to configure relevant parameters. The following are the key parameters required for this tutorial. Retain default values for parameters not listed.
Parameter
Description
Scheduling Parameters
Set scheduling parameters for the entire workflow, which can be directly used by internal nodes in the workflow.
In this tutorial, configure it as
bizdate=$[yyyymmdd-1]to obtain the date of the previous day.NoteDataWorks provides scheduling parameters to enable dynamic code input. You can define variables in SQL code using the
${Variable Name}format and assign values to these variables in Scheduling > Scheduling Parameters. For details on supported formats for scheduling parameters, see Supported formats for scheduling parameters.Scheduling Cycle
In this tutorial, configure it as
Day.Scheduling Time
In this tutorial, set Scheduling Time to
00:30. The workflow will start at00:30every day.Scheduling Dependencies
The workflow has no upstream dependency, so this can be left unconfigured. For easier unified management, you can click Use Workspace Root Node to mount the workflow under the workspace root node.
The naming format for the workspace root node is:
WorkspaceName_root.
Configure the synchronization task
Configure initial node
Configure product info pipeline
Configure order data pipeline
Step 2: Data cleaning
After data is synchronized from MySQL to MaxCompute, resulting in two data tables (the product information table ods_item_info and the order information table ods_trade_order), you can clean, process, and analyze the data in the DataStudio module of DataWorks to obtain the daily ranking of best-selling product categories.
Build a data processing pipeline
In the left navigation pane of DataStudio, click
to enter the data development page. Then, in the Workspace Directories area, find the created workflow, click to enter the workflow orchestration page, drag MaxCompute SQL nodes from the left side to the canvas, and set the node names respectively.The node names and functions are described below:
Node Type
Node Name
Node Function
MaxCompute SQLdim_item_infoProcesses product dimension data based on the
ods_item_infotable to produce the product basic information dimension tabledim_item_info.
MaxCompute SQLdwd_trade_orderPerforms initial cleaning, transformation, and business logic processing on detailed order transaction data based on the
ods_trade_ordertable to produce the transaction order detail fact tabledwd_trade_order.
MaxCompute SQLdws_daily_category_salesAggregates the cleaned and standardized detail data from the DWD layer based on the
dwd_trade_ordertable anddim_item_infotable to produce the daily product category sales summary tabledws_daily_category_sales.
MaxCompute SQLads_top_selling_categoriesProduces the daily best-selling product category ranking table
ads_top_selling_categoriesbased on thedws_daily_category_salestable.Manually drag lines to configure the upstream nodes for each node. The final effect is as follows:
NoteThe workflow supports setting upstream and downstream dependencies for each node via manual connection. It also supports using code parsing within child nodes to automatically identify node dependencies. This tutorial uses the manual connection method. For more information about code parsing, see Automatic dependency parsing.
Configure data processing nodes
Configure dim_item_info
Configure dwd_trad_order
Configure dws_daily_category_sales
Configure ads_top_selling_categories
Step 3: Debug and run
After the workflow configuration is complete, you need to run the entire workflow to verify the correctness of the configuration before deploying it to the production environment.
In the left navigation pane of DataStudio, click
to enter the data development page. Then, in the Workspace Directories area, find the created workflow.Click Run in the node toolbar, and fill in Value Used in This Run with the date of the previous day (e.g.,
20250416).NoteIn the workflow node configuration, DataWorks scheduling parameters have been used to implement dynamic code input. You need to assign a constant value to this parameter for testing during debugging.
Click OK to enter the debug running page.
Wait for the run to complete. The expected result is as follows:

Step 4: Data query and visualization
You have processed the raw test data obtained from MySQL through data development and aggregated it into the table ads_top_selling_categories. Now you can query the table data to view the data analysis results.
Click the
icon in the upper-left corner, and click in the pop-up page.Click next to My Files, customize the File Name, and click OK.
On the SQL Query page, configure the following SQL.
SELECT * FROM ads_top_selling_categories WHERE pt=${bizdate};Select the MaxCompute data source in the upper-right corner and click OK.
Click the Run button at the top, and click Run on the Cost Estimation page.
Click
in the query results to view the visualized chart results. You can click
in the upper-right corner of the chart to customize the chart style. You can also click Save in the upper-right corner of the chart to save the chart as a card, and then click Card (
) in the left navigation pane to view it.
Step 5: Periodic scheduling
By completing the previous steps, you have obtained the sales data for various products from the previous day. However, if you need to obtain the latest sales data every day, you can deploy the workflow to the production environment to make it execute periodically at scheduled times.
Scheduling-related parameters were configured for the workflow, synchronization nodes, and data processing nodes when configuring data synchronization and data processing. You do not need to configure them again here; simply deploy the workflow to the production environment. For more detailed information about scheduling configuration, see Node scheduling configuration.
Click the
icon in the upper-left corner, and click in the pop-up page.In the left navigation pane of DataStudio, click
to enter the data development page, switch to the project space used in this case, and then find the created workflow in the Workspace Directories area.Click Deploy in the node toolbar. In the deployment panel, click Start Deployment to Production. Wait for Build Package and Prod Online Check to complete, and then click Deploy.
After the Prod Online status becomes Complete, click Perform O&M to go to the Operation Center.

In , you can see the periodic tasks of the workflow (in this tutorial, the workflow is named
dw_quickstart).To view the periodic task details of child nodes within the workflow, right-click the periodic task of the workflow and select View Internal Tasks.

The expected result is as follows:

Next steps
For more operational details and parameter explanations of each module in this tutorial, see Data Integration, Data Studio (New), DataAnalysis, and Node scheduling configuration.
In addition to the modules introduced in this tutorial, DataWorks also supports multiple modules such as Data Modeling, Data Quality, Data Security Guard, and DataService Studio, providing you with one-stop data monitoring and O&M.
You can also experience more DataWorks practical tutorials. For details, see More use cases and tutorials.


