Data Lake Formation (DLF) provides metadata extraction and data exploration features. This tutorial walks you through analyzing Taobao user behavior data end-to-end: upload a CSV dataset to OSS, run a metadata extraction task to auto-detect schemas, and query the resulting tables with SQL in the DLF Data Exploration editor.
Prerequisites
Before you begin, ensure that you have:
An Object Storage Service (OSS) bucket. See Create a bucket
What you'll learn
By following this tutorial, you will:
Upload a sample CSV dataset to OSS
Use DLF Metadata Discovery to automatically detect schemas and create metadata tables
Run SQL queries in the DLF Data Exploration editor to analyze user behavior, funnel metrics, and item popularity
Dataset description
This tutorial uses a pruned version of the Taobao user behavior dataset from an Alibaba Cloud Tianchi competition. The dataset contains user behavior records and product data in CSV format, covering December 1–7, 2014.
For the full dataset, see Tianchi dataset #46.
The dataset includes two tables: a user table tracking individual user actions, and an item table describing products.
User table
Each row in the user table represents one user action. A sample row looks like this:
user_id, item_id, behavior_type, user_geohash, item_category, time
98047837, 232431562, 1, 9q9hr, 4245, 2014-12-06 02| Column | Description | Remarks |
|---|---|---|
| user_id | The user ID | Sampled and desensitized. The ID is not a real ID. |
| item_id | The item ID | Desensitized. The ID is not a real ID. |
| behavior_type | The type of the user behavior | Valid values: 1, 2, 3, and 4. 1 indicates clicks, 2 indicates add-to-favorites, 3 indicates add-to-cart, and 4 indicates payment. |
| user_geohash | The geographical location of the user when the behavior occurs. The value may be empty. | Generated based on the latitude and longitude of the user and by using a confidential algorithm. |
| item_category | The category ID of the item | Desensitized. The ID is not a real ID. |
| time | The time when the behavior occurs | Accurate to the nearest hour. |
Item table
Each row in the item table represents one product. A sample row looks like this:
item_id, item_geohash, item_category
100003064, 93rbz7, 5894| Column | Description | Remarks |
|---|---|---|
| item_id | The item ID | Sampled and desensitized. The ID is not a real ID. |
| item_geohash | The geographical location of the item when the behavior occurs. The value may be empty. | Generated based on the latitude and longitude of the item and by using a confidential algorithm. |
| item_category | The category ID of the item | Desensitized. The ID is not a real ID. |
Procedure
Step 1: Activate DLF and OSS
Log on to the DLF console.
Activate DLF and OSS and grant the required permissions. Skip this step if you have already activated both services.
If DLF is not yet activated, you are prompted to activate it when you first log on. Click Activate Data Lake Formation for Free.
After activation, return to the DLF console homepage. Click Activate to activate OSS and grant DLF the permissions to access dependent data sources.
The DLF console homepage opens.
Step 2: Upload the data files to OSS
Download the sample data to your local disk. Decompress the package. The
user_behavior_datafolder contains two subfolders:itemanduser, each holding CSV data files. This tutorial focuses on theuserfolder. The following figure shows some of the data in the folder.
Upload the
user_behavior_datafolder to OSS. See Simple upload. The following figure shows the resulting directory structure in OSS. Theitemandusersubfolders correspond to the two tables.NoteDelete the .DS_Store file from both folders before uploading.

Step 3: Extract metadata
Log on to the DLF console.
Create a database.
In the left-side navigation pane, choose Metadata > Metadata.
Click the Database tab, select the catalog from the Catalog List drop-down list, and then click Create Database.
Configure the following parameters and click OK.
Parameter Description Catalog Select a catalog. Database Name Enter a database name. Database Description Optional. Enter a description. Select Path Select the OSS path where you uploaded user_behavior_datain Step 2.Confirm that the database appears in the list.

Run a metadata extraction task.
In the left-side navigation pane, choose Metadata > Metadata Discovery.
On the Metadata Discovery page, click Create Extraction Task. For a full parameter reference, see Metadata discovery.
Configure the Set Extraction Source section and click Next.

Set the Destination Database parameter and click Next.

Configure the Set Extraction Task section with the following settings.
Parameter Value Notes RAM Role AliyunDLFWorkFlowDefaultRole Pre-granted during activation. Execution Policy Manual Runs the task immediately on demand. Extraction Policy Extract All Scans every file for metadata. More accurate than sampling, but takes longer for large datasets. 
Review the configuration and click Save and Execute.

DLF returns to the Metadata Discovery page and starts the task automatically. Hover over the question mark icon in the status column. A message confirms that two metadata tables were created — one for
itemand one foruser.
Verify the extracted tables.
Click the Database link, then click the Table List tab to see the tables created in the database.

Click a table name to confirm the extracted schema matches the expected columns.

Step 4: Analyze user behavior data
Data analysis in this tutorial consists of three stages:
Preview and check data.
Preprocess data.
Analyze user engagement, funnels, and item popularity.
In the left-side navigation pane, click Data Exploration to open the SQL editor.
Preview and check data
Run the following statements to inspect the raw data and count records. SET spark.sql.legacy.timeParserPolicy=LEGACY ensures the time column parses correctly with Spark SQL's legacy date format parser.
SET spark.sql.legacy.timeParserPolicy=LEGACY;
-- Preview raw rows
SELECT * FROM `demo_db`.`user` LIMIT 10;
SELECT * FROM `demo_db`.`item` LIMIT 10;
-- Count distinct users
SELECT COUNT(DISTINCT user_id) FROM `demo_db`.`user`;
-- Count distinct items
SELECT COUNT(DISTINCT item_id) FROM `demo_db`.`item`;
-- Count total behavior records
SELECT COUNT(*) FROM `demo_db`.`user`;The following figure shows the result.

Preprocess data
The raw user table stores behavior types as numeric codes (1–4) and combines date and time in a single column. The following statement transforms this raw data into a more analysis-friendly format:
USING PARQUETstores the output in columnar Parquet format, which compresses data and speeds up analytical queries compared to CSV.PARTITIONED BY (date)lets Spark skip entire date partitions when a query filters on a specific day.The
CASE WHENblock converts numeric behavior codes to readable labels (click,collect,cart,pay).date_formatsplits thetimecolumn into separatedate,hour, andday_of_weekcolumns, enabling date- and hour-level analysis in later queries.
CREATE TABLE `demo_db`.`user_log`
USING PARQUET
PARTITIONED BY (date)
AS SELECT
user_id,
item_id,
-- Convert numeric behavior codes to readable labels
CASE
WHEN behavior_type = 1 THEN 'click'
WHEN behavior_type = 2 THEN 'collect'
WHEN behavior_type = 3 THEN 'cart'
WHEN behavior_type = 4 THEN 'pay'
END AS behavior,
item_category,
time,
date_format(time, 'yyyy-MM-dd') AS date, -- Used as partition key
date_format(time, 'H') AS hour,
date_format(time, 'u') AS day_of_week -- 1=Monday ... 7=Sunday
FROM `dlf_demo`.`user`;
-- Verify the result
SELECT * FROM `demo_db`.`user_log` LIMIT 10;The following figure shows the result.

Analyze user behavior
All three analyses below query the user_log table created in the previous step.
Funnel analysis: Count how many times users performed each behavior type across the entire dataset. This reveals where users drop off in the click > collect > cart > pay funnel.
-- Takes approximately 13 seconds SELECT behavior, COUNT(*) AS total FROM `demo_db`.`user_log` GROUP BY behavior ORDER BY total DESC;The following figure shows the result.

Daily engagement: Break down unique visitors (UV) and behavior counts by date and day of week. This shows whether activity peaks on weekdays or weekends.
NoteBecause the dataset is pruned for this tutorial, the day-of-week pattern may not reflect real Taobao traffic. The query structure is correct — apply it to the full dataset for meaningful trends.
-- Takes approximately 14 seconds SELECT date, day_of_week, COUNT(DISTINCT(user_id)) AS uv, SUM(CASE WHEN behavior = 'click' THEN 1 ELSE 0 END) AS click, SUM(CASE WHEN behavior = 'cart' THEN 1 ELSE 0 END) AS cart, SUM(CASE WHEN behavior = 'collect' THEN 1 ELSE 0 END) AS collect, SUM(CASE WHEN behavior = 'pay' THEN 1 ELSE 0 END) AS pay FROM `demo_db`.`user_log` GROUP BY date, day_of_week ORDER BY date;Top item categories: Join the
itemtable withuser_logto find the 10 most purchased item categories. Filtering onbehavior = 'pay'isolates actual purchases rather than browsing activity.-- Takes approximately 1 minute and 10 seconds SELECT item.item_category, COUNT(*) AS times FROM `demo_db`.`item` item JOIN `demo_db`.`user_log` log ON item.item_id = log.item_id WHERE log.behavior = 'pay' GROUP BY item.item_category ORDER BY times DESC LIMIT 10;(Optional) Download results: Save query results as a CSV file.
In the upper-right corner of the Data Exploration page, click Path Settings. In the Select OSS Path dialog box, set the Result Storage Path to an existing folder or create a new one.

After running a query, go to the Running History tab and click Download in the Actions column to download the result file. The file is also accessible from the OSS console.
(Optional) Save SQL statements: Click Save to store the SQL statements. Saved queries are available on the Saved Query tab for future runs.
What's next
Now that you have completed the end-to-end tutorial, explore these topics to go further:
Schedule extraction tasks: Automate metadata extraction by setting a scheduled execution policy instead of Manual.
Query larger datasets: Apply the same SQL patterns to your own data by pointing the extraction task at your OSS buckets.
Optimize query performance: Partition your tables by the columns you filter on most frequently to reduce scan time.
FAQ
If you have questions or want to further explore data lake technologies, follow the official account Data Lake Technology Circle in WeChat.