All Products
Search
Document Center

Data Lake Formation:DLF data exploration quick start: Taobao user behavior analysis

Last Updated:Mar 25, 2026

Data Lake Formation (DLF) provides metadata extraction and data exploration features. This tutorial walks you through analyzing Taobao user behavior data end-to-end: upload a CSV dataset to OSS, run a metadata extraction task to auto-detect schemas, and query the resulting tables with SQL in the DLF Data Exploration editor.

Prerequisites

Before you begin, ensure that you have:

What you'll learn

By following this tutorial, you will:

  • Upload a sample CSV dataset to OSS

  • Use DLF Metadata Discovery to automatically detect schemas and create metadata tables

  • Run SQL queries in the DLF Data Exploration editor to analyze user behavior, funnel metrics, and item popularity

Dataset description

This tutorial uses a pruned version of the Taobao user behavior dataset from an Alibaba Cloud Tianchi competition. The dataset contains user behavior records and product data in CSV format, covering December 1–7, 2014.

For the full dataset, see Tianchi dataset #46.

The dataset includes two tables: a user table tracking individual user actions, and an item table describing products.

User table

Each row in the user table represents one user action. A sample row looks like this:

user_id,  item_id,  behavior_type,  user_geohash,  item_category,  time
98047837, 232431562, 1,             9q9hr,          4245,           2014-12-06 02
ColumnDescriptionRemarks
user_idThe user IDSampled and desensitized. The ID is not a real ID.
item_idThe item IDDesensitized. The ID is not a real ID.
behavior_typeThe type of the user behaviorValid values: 1, 2, 3, and 4. 1 indicates clicks, 2 indicates add-to-favorites, 3 indicates add-to-cart, and 4 indicates payment.
user_geohashThe geographical location of the user when the behavior occurs. The value may be empty.Generated based on the latitude and longitude of the user and by using a confidential algorithm.
item_categoryThe category ID of the itemDesensitized. The ID is not a real ID.
timeThe time when the behavior occursAccurate to the nearest hour.

Item table

Each row in the item table represents one product. A sample row looks like this:

item_id,   item_geohash,  item_category
100003064, 93rbz7,        5894
ColumnDescriptionRemarks
item_idThe item IDSampled and desensitized. The ID is not a real ID.
item_geohashThe geographical location of the item when the behavior occurs. The value may be empty.Generated based on the latitude and longitude of the item and by using a confidential algorithm.
item_categoryThe category ID of the itemDesensitized. The ID is not a real ID.

Procedure

Step 1: Activate DLF and OSS

  1. Log on to the DLF console.

  2. Activate DLF and OSS and grant the required permissions. Skip this step if you have already activated both services.

    1. If DLF is not yet activated, you are prompted to activate it when you first log on. Click Activate Data Lake Formation for Free.

    2. After activation, return to the DLF console homepage. Click Activate to activate OSS and grant DLF the permissions to access dependent data sources.

  3. The DLF console homepage opens.

Step 2: Upload the data files to OSS

  1. Download the sample data to your local disk. Decompress the package. The user_behavior_data folder contains two subfolders: item and user, each holding CSV data files. This tutorial focuses on the user folder. The following figure shows some of the data in the folder.

    image

  2. Upload the user_behavior_data folder to OSS. See Simple upload. The following figure shows the resulting directory structure in OSS. The item and user subfolders correspond to the two tables.

    Note

    Delete the .DS_Store file from both folders before uploading.

    image

Step 3: Extract metadata

  1. Log on to the DLF console.

  2. Create a database.

    1. In the left-side navigation pane, choose Metadata > Metadata.

    2. Click the Database tab, select the catalog from the Catalog List drop-down list, and then click Create Database.

    3. Configure the following parameters and click OK.

      ParameterDescription
      CatalogSelect a catalog.
      Database NameEnter a database name.
      Database DescriptionOptional. Enter a description.
      Select PathSelect the OSS path where you uploaded user_behavior_data in Step 2.
    4. Confirm that the database appears in the list. image

  3. Run a metadata extraction task.

    1. In the left-side navigation pane, choose Metadata > Metadata Discovery.

    2. On the Metadata Discovery page, click Create Extraction Task. For a full parameter reference, see Metadata discovery.

    3. Configure the Set Extraction Source section and click Next. image

    4. Set the Destination Database parameter and click Next. image

    5. Configure the Set Extraction Task section with the following settings.

      ParameterValueNotes
      RAM RoleAliyunDLFWorkFlowDefaultRolePre-granted during activation.
      Execution PolicyManualRuns the task immediately on demand.
      Extraction PolicyExtract AllScans every file for metadata. More accurate than sampling, but takes longer for large datasets.

      image

    6. Review the configuration and click Save and Execute. image

    7. DLF returns to the Metadata Discovery page and starts the task automatically. Hover over the question mark icon in the status column. A message confirms that two metadata tables were created — one for item and one for user. image

  4. Verify the extracted tables.

    1. Click the Database link, then click the Table List tab to see the tables created in the database. image

    2. Click a table name to confirm the extracted schema matches the expected columns. image image

Step 4: Analyze user behavior data

Data analysis in this tutorial consists of three stages:

  1. Preview and check data.

  2. Preprocess data.

  3. Analyze user engagement, funnels, and item popularity.

In the left-side navigation pane, click Data Exploration to open the SQL editor.

Preview and check data

Run the following statements to inspect the raw data and count records. SET spark.sql.legacy.timeParserPolicy=LEGACY ensures the time column parses correctly with Spark SQL's legacy date format parser.

SET spark.sql.legacy.timeParserPolicy=LEGACY;

-- Preview raw rows
SELECT * FROM `demo_db`.`user` LIMIT 10;
SELECT * FROM `demo_db`.`item` LIMIT 10;

-- Count distinct users
SELECT COUNT(DISTINCT user_id) FROM `demo_db`.`user`;

-- Count distinct items
SELECT COUNT(DISTINCT item_id) FROM `demo_db`.`item`;

-- Count total behavior records
SELECT COUNT(*) FROM `demo_db`.`user`;

The following figure shows the result.

image

Preprocess data

The raw user table stores behavior types as numeric codes (1–4) and combines date and time in a single column. The following statement transforms this raw data into a more analysis-friendly format:

  • USING PARQUET stores the output in columnar Parquet format, which compresses data and speeds up analytical queries compared to CSV.

  • PARTITIONED BY (date) lets Spark skip entire date partitions when a query filters on a specific day.

  • The CASE WHEN block converts numeric behavior codes to readable labels (click, collect, cart, pay).

  • date_format splits the time column into separate date, hour, and day_of_week columns, enabling date- and hour-level analysis in later queries.

CREATE TABLE `demo_db`.`user_log`
USING PARQUET
PARTITIONED BY (date)
AS SELECT
    user_id,
    item_id,
    -- Convert numeric behavior codes to readable labels
    CASE
      WHEN behavior_type = 1 THEN 'click'
      WHEN behavior_type = 2 THEN 'collect'
      WHEN behavior_type = 3 THEN 'cart'
      WHEN behavior_type = 4 THEN 'pay'
    END AS behavior,
    item_category,
    time,
    date_format(time, 'yyyy-MM-dd') AS date,      -- Used as partition key
    date_format(time, 'H')          AS hour,
    date_format(time, 'u')          AS day_of_week -- 1=Monday ... 7=Sunday
  FROM `dlf_demo`.`user`;

-- Verify the result
SELECT * FROM `demo_db`.`user_log` LIMIT 10;

The following figure shows the result.

image

Analyze user behavior

All three analyses below query the user_log table created in the previous step.

  1. Funnel analysis: Count how many times users performed each behavior type across the entire dataset. This reveals where users drop off in the click > collect > cart > pay funnel.

    -- Takes approximately 13 seconds
    SELECT
      behavior, COUNT(*) AS total
    FROM `demo_db`.`user_log`
    GROUP BY behavior
    ORDER BY total DESC;

    The following figure shows the result.

    image

  2. Daily engagement: Break down unique visitors (UV) and behavior counts by date and day of week. This shows whether activity peaks on weekdays or weekends.

    Note

    Because the dataset is pruned for this tutorial, the day-of-week pattern may not reflect real Taobao traffic. The query structure is correct — apply it to the full dataset for meaningful trends.

    -- Takes approximately 14 seconds
    SELECT
      date, day_of_week,
      COUNT(DISTINCT(user_id)) AS uv,
      SUM(CASE WHEN behavior = 'click'   THEN 1 ELSE 0 END) AS click,
      SUM(CASE WHEN behavior = 'cart'    THEN 1 ELSE 0 END) AS cart,
      SUM(CASE WHEN behavior = 'collect' THEN 1 ELSE 0 END) AS collect,
      SUM(CASE WHEN behavior = 'pay'     THEN 1 ELSE 0 END) AS pay
    FROM `demo_db`.`user_log`
    GROUP BY date, day_of_week
    ORDER BY date;
  3. Top item categories: Join the item table with user_log to find the 10 most purchased item categories. Filtering on behavior = 'pay' isolates actual purchases rather than browsing activity.

    -- Takes approximately 1 minute and 10 seconds
    SELECT item.item_category, COUNT(*) AS times
    FROM `demo_db`.`item` item
      JOIN `demo_db`.`user_log` log
      ON item.item_id = log.item_id
    WHERE log.behavior = 'pay'
    GROUP BY item.item_category
    ORDER BY times DESC
    LIMIT 10;
  4. (Optional) Download results: Save query results as a CSV file.

    1. In the upper-right corner of the Data Exploration page, click Path Settings. In the Select OSS Path dialog box, set the Result Storage Path to an existing folder or create a new one. image

    2. After running a query, go to the Running History tab and click Download in the Actions column to download the result file. The file is also accessible from the OSS console.

  5. (Optional) Save SQL statements: Click Save to store the SQL statements. Saved queries are available on the Saved Query tab for future runs.

What's next

Now that you have completed the end-to-end tutorial, explore these topics to go further:

  • Schedule extraction tasks: Automate metadata extraction by setting a scheduled execution policy instead of Manual.

  • Query larger datasets: Apply the same SQL patterns to your own data by pointing the extraction task at your OSS buckets.

  • Optimize query performance: Partition your tables by the columns you filter on most frequently to reduce scan time.

FAQ

If you have questions or want to further explore data lake technologies, follow the official account Data Lake Technology Circle in WeChat.