All Products
Search
Document Center

Platform For AI:Feature engineering

Last Updated:Nov 30, 2023

You can use the feature engineering capability that is trained by the recommendation algorithms to process the original datasets and generate new feature tables that can be used in subsequent vector recall or ranking operations. The original datasets include tables such as user tables, item tables, or behavior tables.

Prerequisites

Datasets

The user table, item table, and behavior table that are generated by the script are not real datasets. The tables are used as examples in the following demo.

User table: pai_online_project.rec_sln_demo_user_table

Field

Type

Description

user_id

bigint

The unique ID of the user.

gender

string

The gender of the user.

age

bigint

The age of the user.

city

string

The city in which the user resides.

item_cnt

bigint

The number of content that the user creates.

follow_cnt

bigint

The number of users that the user follows.

follower_cnt

bigint

The number of followers of the user.

register_time

bigint

The registration time of the account.

tags

string

The tags of the user.

ds

string

The name of the partition column in the table.

Item table: pai_online_project.rec_sln_demo_item_table

Field

Type

Description

item_id

bigint

The ID of the item.

duration

double

The duration of the video.

title

string

The title of the item.

category

string

The level-1 tag.

author

bigint

The author of the item.

click_count

bigint

The total clicks of the item.

praise_count

bigint

The total likes of the item.

pub_time

bigint

The date when the item is released.

ds

string

The name of the partition column in the table.

Behavior table: pai_online_project.rec_sln_demo_behavior_table

Field

Type

Description

request_id

bigint

The ID of the tracking point or request.

user_id

bigint

The unique ID of the user.

exp_id

string

The experiment ID.

page

string

The page.

net_type

string

The network type.

event_time

bigint

The time when the behavior occurs.

item_id

bigint

The ID of the item.

event

string

The type of behavior.

playtime

double

The playback duration or the reading duration

ds

string

The name of the partition column in the table.

Procedure

Step 1: Go to Machine Learning Designer

  1. Log on to the PAI console.

  2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.

  3. In the left-side navigation pane, choose Model Development and Training > Visualized Modeling (Designer).

Step 2: Build a pipeline

  1. On the Visualized Modeling (Designer) page, click the Preset Templates tab.

  2. In the Recommended Solution - Feature Engineering section of the template list page, click Create.

  3. In the Create Pipeline dialog box, configure the parameters. You can use the default values.

    The value specified for the Pipeline Data Path parameter is an Object Storage Service (OSS) bucket path that is used to store temporary data and models generated during the runtime of the pipeline.

  4. Click OK.

    It takes approximately about 10 seconds to create a pipeline.

  5. In the pipeline list, double-click Recommended Solution - Feature Engineering to enter the pipeline.

  6. View the components of the pipeline on the canvas, as shown in the following figure. The system automatically creates the pipeline based on the preset template. image.png

    Component

    Description

    1

    Preprocesses the item table.

    • Replaces the tag feature delimiter with a chr(29) for the subsequent feature generation (FG).

    • Generates a feature that indicates whether the item is new.

    2

    Preprocesses the behavior table: outputs the derived features of behavior time, such as day_h and week_day.

    3

    Preprocesses the user table.

    • Generates a feature that indicates whether a user is new.

    • Replaces the tag feature delimiter with a chr(29) for the subsequent feature generation (FG).

    4

    Associates the behavior table, user table, and item table to generate a wide behavior log table that has statistical properties.

    5

    Generates the item feature table, which contains the statistical features of items for a period.

    • item__{event}_cnt_{N}d: the number of certain behaviors that occurred on the item in N days, which indicates the popularity of the item.

    • item__{event}_{itemid}_dcnt_{N}d: the number of unique users who performed certain behaviors on the item within N days, which indicates the popularity of the item.

    • item__{min|max|avg|sum}_{field}_{N}d: the statistical distribution of a specific numeric property of users that perform positive behaviors on the item within N days, which indicates the numeric property of users that prefer the item.

    • item__kv_{cate}_{event}_{N}d: the statistics of a certain categorical property of the users that perform certain behaviors on the item within N days, which indicates the categorical property of users that prefer the item.

    6

    Generates the user feature table, which contains the statistical features of users for a period.

Step 3: Add a function

  1. Create a workflow. For more information, see Create a workflow.

  2. Right-click MaxCompute below the workflow that you create and choose Create Resource > Python to create a Python script that is named count_cates_kvs.py. For more information, see Create and use MaxCompute resources.

  3. Right-click MaxCompute below the workflow that you create and select Create Function to create a MaxCompute function that is named COUNT_CATES_KVS. Set class name to count_cates_kvs. CountCatesKVS, and Resources to count_cates_kvs.py. For more information, see Create and use a MaxCompute function.

Step 4: Run the workflow and view the results

Note

The dataset that is used in this example uses data of 45 days and takes a long time to run. If you want the workflow to run for a shorter period, perform the following operations:

  • Change the execution time window parameters to use data of a shorter period of time.

    • Click the following components and change the Execution time window parameter on the Parameters Setting tab from (-45,0] to the (-9,0]:

      • 1_rec_sln_demo_item_table_preprocess_v2

      • 2_rec_sln_demo_behavior_table_preprocess_ v2

      • 3_rec_sln_demo_user_table_preprocess_v2

      • 4_rec_sln_demo_behavior_table_preprocess_wide_v2

    • Click the following components and change the Execution time window parameter on the Parameters Setting tab from (-31,0] to the (-8,0]:

      • 5_rec_sln_demo_item_table_preprocess_all_feature_v2

      • 6_rec_sln_demo_user_table_preprocess_all_feature_v2

  • Modify the SQL script to use a smaller sample size of users.

    • Click the component 2_rec_sln_demo_behavior_table_preprocess_ v2. On the Parameters Setting tab, change the line 32 of the SQL Script parameter from WHERE ds = '${pai.system.cycledate}' to WHERE ds = '${pai.system.cycledate}' and user_id %10=1.

    • Click the component 3_rec_sln_demo_user_table_preprocess_v2. On the Parameters Setting tab, change the line 38 of the SQL Script parameter from WHERE ds = '${pai.system.cycledate}' to WHERE ds = '${pai.system.cycledate}' and user_id %10=1.

  1. Click the Run button image.png in the top toolbar of the pipeline canvas.

  2. After you run the pipeline, check whether the following MaxCompute tables have data for 30 days:

    • Item feature table: rec_sln_demo_item_table_preprocess_all_feature_v2

    • Behavior log wide table: rec_sln_demo_behavior_table_preprocess_v2

    • User feature table: rec_sln_demo_user_table_preprocess_all_feature_v2

    You can query data in the preceding tables by using SQL statements. For more information, see DataWorks.

    Note

    A full table scan is disabled for a partitioned table of the project to which the table belongs. The partition that you want to scan must be specified. If you want to scan a full table by using an SQL statement, you can add set odps.sql.allow.fullscan=true to the SQL statement and commit the command together with the SQL statement. A full table scan increases data inputs and costs.