In this experiment, you can experience retail e-commerce data development and analysis scenarios based on the OpenLake House platform, use DataWorks to perform multi-engine collaborative development, and orchestrate workflows and manage data catalogs in a visualized manner. In addition, you can perform Python-based programming and debugging, and use notebooks together with AI to perform interactive data exploration and analysis.
Background information
Overview of DataWorks
DataWorks is an intelligent, lakehouse-based data development and governance platform that leverages the big data development methodology of Alibaba Group based on 15 years of big data experience. DataWorks is deeply compatible with dozens of big data and AI computing services provided by Alibaba Cloud, such as MaxCompute, E-MapReduce (EMR), Hologres, Realtime Compute for Apache Flink, and Platform for AI (PAI). DataWorks supports intelligent extract, transform, load (ETL) development, data analysis, and proactive data asset governance for data warehouses, data lakes, and the OpenLake lakehouse architecture to facilitate data management throughout the Data+AI lifecycle. Since 2009, DataWorks has continuously productized and refined the Alibaba data system to serve various industries such as public service sectors, finance, retail, Internet, automobile, and manufacturing. DataWorks has earned the trust of tens of thousands of customers who choose DataWorks for digital transformation and value creation.
Overview of DataWorks Copilot
DataWorks Copilot is your intelligent assistant in DataWorks. In DataWorks, you can use the default large language model (LLM), DeepSeek-R1-671B (full-power edition), or DeepSeek-R1-Distill-Qwen-32B to complete related operations in DataWorks. With the advanced reasoning power of DeepSeek-R1, DataWorks Copilot enables you to perform complex tasks, such as SQL code generation, optimization, and testing, based on natural language interactions. This greatly enhances the efficiency of extract, transform, load (ETL) development and data analysis.
Overview of the DataWorks notebook feature
The notebook feature of DataWorks provides an intelligent, interactive data development and analysis tool that can be used to perform engine-specific SQL or Python code analysis and run or debug code in real time. This way, you can obtain visualized data processing results. In addition, you can combine notebooks and other types of nodes to form a workflow, and commit the workflow to the scheduling system to run. This helps implement complex business scenarios in a flexible manner.
Usage notes
The regions and DataWorks editions in which the DataWorks Copilot public preview is available are limited. For more information, see the Public preview section in the "DataWorks Copilot" topic.
To use Python and the notebook feature in Data Studio, you must switch to a personal development environment first.
Limits
OpenLake supports only DLF 2.0.
Data catalogs support only DLF 2.0.
Environment preparation
- Note
Make sure that Participate in Public Preview of Data Studio is turned on for the workspace.
Experiment procedure
Step 1: Manage data catalogs
The data catalog management capability of the data lakehouse solution allows you to create and manage data catalogs of compute engine types such as Data Lake Formation (DLF), MaxCompute, and Hologres.
In the left-side navigation pane of the Data Studio page, click the
icon. The DATA CATALOG pane appears. In the DATA CATALOG pane, find the metadata type that you want to manage, move the pointer over the name of the desired data catalog, click the
icon to the right of the data catalog, and then click Open. The configuration tab of the data catalog appears.
On the configuration tab of the data catalog, click the name of a schema. On the tab that appears, click the name of a table. The details page of the table appears.
In the left-side navigation pane of the Data Studio page, click the
icon. The DATA CATALOG pane appears. In the DATA CATALOG pane, find the metadata type that you want to manage, move the pointer over the name of the desired data catalog, click the
icon to the right of the data catalog, and then click Create Table. The Create Table tab appears.
In the left area of the Create Table tab, specify the table name and field names. Alternatively, enter a DDL statement in the right area of the Create Table tab to create a table. Then, click Deploy in the top toolbar.
Step 2: Orchestrate a workflow
In DataWorks, you can perform drag operations in a visualized manner to orchestrate various types of data development nodes in a workflow from the business perspective. You do not need to separately configure common parameters such as the scheduling time. This helps you easily manage complex task projects.
In the left-side navigation pane of the Data Studio page, click the
icon. The DATASTUDIO pane appears. In the DATASTUDIO pane, click Workspace Directories, click the
icon on the right, and then click Create Workflow.
In the popover that appears, enter a workflow name and press the Enter key.
Preset workflow name:
Retail e-commerce business overview
On the configuration tab of the workflow, click Drag or Click to Add Node in the center of the canvas. In the Create Node dialog box, configure the Node Type and Node Name parameters and click Confirm.
Preset node name:
Retail e-commerce overview
Preset node type:
Zero load node
In the Create Node section of the configuration tab of the workflow, find the node type that you want to use, drag the node type to the canvas, and then release. In the Create Node dialog box, specify the node name and click Confirm.
The following table describes the preset node names and types.
Node type
Node name
Data Integration - Batch synchronization
ods_mbr_user_info
MaxCompute-MaxCompute SQL
dim_ec_mbr_user_info
MaxCompute-MaxCompute SQL
dws_ec_mbr_cnt_nd
Notebook
ads_ec_kpi_report
On the canvas, find two nodes for which you want to configure dependencies, and move the pointer over the middle of the lower edge of the rectangular box of one node. When the + icon appears, drag the connection line to connect to the other node and then release.
Create the required nodes in sequence, configure the dependencies for the nodes, and then click Save.
After you save the configurations, you can change the layout mode of the nodes on the canvas in the top toolbar based on your business requirements.
In the right-side navigation pane of the configuration tab of the workflow, click Properties. On the Properties tab, configure the scheduling parameters and node dependencies for the workflow. In the Scheduling Parameters section of the tab, click Add Parameter. In the input box that appears, enter bizdate in the Parameter Name field and select $[yyyymmdd-1] from the Parameter Value drop-down list.
In the Scheduling Dependencies section of the tab, click Add Dependency. In the section that appears, enter ads_ec_ec360_gmv_kpi_overview in the Ancestor Object field and press Enter. Wait for the results to return. Select the ancestor objects that you want from the result list and click Add.
In the top toolbar, click Deploy. The Deployment tab appears in the lower-right corner. Click Start Deployment to Production Environment to the right of the Deployment Description field, and perform the check and confirmation operations based on the on-screen instructions in sequence.
Step 3: Perform multi-engine collaborative development
Data Studio supports warehouse development for data synchronization nodes, nodes of dozens of compute engine types such as MaxCompute, Hologres, EMR, Flink, and ADB, and nodes developed by using notebooks and Python. Data Studio also supports complex scheduling configurations for these nodes. Data Studio provides the R&D mode of development-production environment isolation. In this experiment, a Flink SQL Streaming node is created.
In the left-side navigation pane of the Data Studio page, click the
icon. The DATASTUDIO pane appears. In the DATASTUDIO pane, click Workspace Directories, click the
icon on the right, and then choose Create Node > Flink > Flink SQL Streaming. In the popover that appears, enter a node name and press the Enter key.
Preset node name:
ads_ec_page_visit_log
On the configuration tab of the node, paste the preset code of the Flink SQL Streaming node into the code editor.
In the right-side navigation pane of the configuration tab of the node, click Real-time Configurations. On the tab that appears, configure the Flink resource-related parameters, script parameters, and Flink runtime parameters. The following figure shows the parameter values.
After you configure the parameters on the Real-time Configurations tab, click Save and then Deploy in the top toolbar. The Deployment tab appears in the lower-right corner. Click Start Deployment to Production Environment to the right of the Deployment Description field, and perform the check and confirmation operations based on the on-screen instructions in sequence.
Step 4: Enter the personal development environment
In a personal development environment, the following items and features are supported: custom container images, connection to File Storage NAS (NAS) and Git, and Python-based programming and notebooks.
In the top navigation bar of the Data Studio page, click Select Personal development environment and then select the personal development environment that you want to enter.
Step 5: Write and debug Python code
DataWorks is deeply integrated with DSW. Data Studio supports the writing, debugging, scheduling, and running of Python code in a personal development environment.
You can perform operations in this step only after you complete Step 4: Enter the personal development environment.
In the selected personal development environment on the Data Studio page, click
workspace
under Personal Directory and click theicon. An unnamed file is added to the list. Enter the preset file name, press the Enter key, and wait for the file to be generated.
Preset file name:
ec_item_rec.py
In the code editor on the configuration tab of the Python file, enter the preset Python code. In the top toolbar, select Run Python File. On the TERMINAL tab, which is in the lower part of the configuration tab, query the running result.
In the top toolbar of the configuration tab of the Python file, select Debug Python File. Move the pointer over a code line number in the code editor. A red dot appears. Click the red dot to add a breakpoint. In the upper-left part above the DATASTUDIO pane, click
to debug the code.
Step 6: Explore data based on notebooks
When you explore data based on notebooks, the related operations are performed in the personal development environment. Therefore, you must complete Step 4: Enter the personal development environment before you perform operations in this step.
Create a notebook
Go to the Data Studio page.
In the Personal Directory section, right-click the desired folder and select Create Notebook.
An unnamed file is added to the list. Enter the notebook name, and press the Enter key or click on the blank space to make the notebook name to take effect.
Click the notebook name in the Personal Directory section. The configuration tab of the notebook appears.
Use the notebook
The operations in the Use the notebook section are independent operations without a specific execution order. You can perform the operations based on your business requirements.
Multi-engine development in the notebook
EMR Spark SQL
On the configuration tab of the notebook, click
to add an SQL cell.
In the SQL cell, enter the following statement to query the dim_ec_mbr_user_info table:
In the lower-right corner of the SQL cell, select EMR Spark SQL and select
openlake_serverless_spark
as the computing resource.Click Run. Wait until the running is complete and view the data result.
StarRocks SQL
On the configuration tab of the notebook, click
to add an SQL cell.
In the SQL cell, enter the following statement to query the dws_ec_trd_cate_commodity_gmv_kpi_fy table:
In the lower-right corner of the SQL cell, select StarRocks SQL and select
openlake_starrocks
as the computing resource.Click Run. Wait until the running is complete and view the data result.
Hologres SQL
On the configuration tab of the notebook, click
to add an SQL cell.
In the SQL cell, enter the following statement to query the dws_ec_mbr_cnt_std table:
In the lower-right corner of the SQL cell, select Hologres SQL and select
openlake_hologres
as the computing resource.Click Run. Wait until the running is complete and view the data result.
MaxCompute SQL
On the configuration tab of the notebook, click
to add an SQL cell.
In the SQL cell, enter the following statement to query the dws_ec_mbr_cnt_std table:
In the lower-right corner of the SQL cell, select MaxCompute SQL and select
openlake_maxcompute
as the computing resource.Click Run. Wait until the running is complete and view the data result.
Interactive data in the notebook
On the configuration tab of the notebook, click
to add a Python cell.
In the upper-right corner of the Python cell, click
to start DataWorks Copilot, an intelligent programming assistant.
In the input box that appears, enter the following requirements to use ipywidgets to generate a widget that can be used to query the ages of members.
NoteRequirement description: Use Python to generate a slider widget of member age. The value range is from 1 to 100 and the default value is 20. Monitor the change of the value in real time and save the value to the query_age global variable.
Check the Python code generated by DataWorks Copilot and click Accept.
Click the Run icon in the Python cell and wait until the running is complete. View the generation of the widget. The widget is generated by running the code generated by DataWorks Copilot or the preset code. You can slide to select the desired age in the widget.
On the configuration tab of the notebook, click
to add an SQL cell.
In the SQL cell, enter the following statement. The statement contains the member age variable
${query_age}
that is defined in Python.SELECT * FROM openlake_win.default.dim_ec_mbr_user_info WHERE CAST(id_age AS INT) >= ${query_age};
In the lower-right corner of the SQL cell, select Hologres SQL and select
openlake_hologres
as the computing resource.Click Run. Wait until the running is complete and view the data result.
In the running result, click
to generate a chart.
Model development and training in the notebook
On the configuration tab of the notebook, click
to add an SQL cell.
In the SQL cell, enter the following statement to query the ods_trade_order table:
SELECT * FROM openlake_win.default.ods_trade_order;
Write SQL query results to a DataFrame variable. Click the place where df is located to specify a custom name for the DataFrame variable, such as
df_ml
.Click the Run icon in the SQL cell, wait until the running is complete, and view the data result.
On the configuration tab of the notebook, click
to add a Python cell.
In the Python cell, enter the following statement to cleanse and process data by using Pandas, and store the data in a new variable
df_ml_clean
of DataFrame:import pandas as pd def clean_data(df_ml): # Generate a new column: Estimated total order amount = Product unit price × Product quantity. df_ml['predict_total_fee'] = df_ml['item_price'].astype(float).values * df_ml['buy_amount'].astype(float).values # Rename the total_fee column to actual_total_fee. df_ml = df_ml.rename(columns={'total_fee': 'actual_total_fee'}) return df_ml df_ml_clean = clean_data(df_ml.copy()) df_ml_clean.head()
Click the Run icon in the Python cell, wait until the running is complete, and view the data cleansing result.
On the configuration tab of the notebook, click
to add a Python cell again.
In the Python cell, enter the following statement to build a linear regression machine learning model and perform training and testing.
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error # Obtain the product price and total costs. X = df_ml_clean[['predict_total_fee']].values y = df_ml_clean['actual_total_fee'].astype(float).values # Prepare data. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, random_state=42) # Create and train a model. model = LinearRegression() model.fit(X_train, y_train) # Perform prediction and evaluation. y_pred = model.predict(X_test) for index, (x_t, y_pre, y_t) in enumerate(zip(X_test, y_pred, y_test)): print("[{:>2}] input: {:<10} prediction:{:<10} gt: {:<10}".format(str(index+1), f"{x_t[0]:.3f}", f"{y_pre:.3f}", f"{y_t:.3f}")) # Calculate the mean squared error (MSE). mse = mean_squared_error(y_test, y_pred) print("MSE:", mse)
Click Run, wait until the running is complete, and view the test result of the model training.