You can use the feature engineering capability that is trained by the recommendation algorithms to process the original datasets and generate new feature tables that can be used in subsequent vector recall or ranking operations. The original datasets include tables such as user tables, item tables, or behavior tables.
Prerequisites
Machine Learning Designer of Platform for AI (PAI) is activated and the default workspace is created. For more information, see Activate PAI and create the default workspace.
MaxCompute resources are associated with the workspace. For more information, see Manage workspaces.
A MaxCompute data source is created and associated as the engine of the workspace. For more information, see Add a MaxCompute data source of the new version.
Datasets
The user table, item table, and behavior table that are generated by the script are not real datasets. The tables are used as examples in the following demo.
User table: pai_online_project.rec_sln_demo_user_table
Field | Type | Description |
user_id | bigint | The unique ID of the user. |
gender | string | The gender of the user. |
age | bigint | The age of the user. |
city | string | The city in which the user resides. |
item_cnt | bigint | The number of content that the user creates. |
follow_cnt | bigint | The number of users that the user follows. |
follower_cnt | bigint | The number of followers of the user. |
register_time | bigint | The registration time of the account. |
tags | string | The tags of the user. |
ds | string | The name of the partition column in the table. |
Item table: pai_online_project.rec_sln_demo_item_table
Field | Type | Description |
item_id | bigint | The ID of the item. |
duration | double | The duration of the video. |
title | string | The title of the item. |
category | string | The level-1 tag. |
author | bigint | The author of the item. |
click_count | bigint | The total clicks of the item. |
praise_count | bigint | The total likes of the item. |
pub_time | bigint | The date when the item is released. |
ds | string | The name of the partition column in the table. |
Behavior table: pai_online_project.rec_sln_demo_behavior_table
Field | Type | Description |
request_id | bigint | The ID of the tracking point or request. |
user_id | bigint | The unique ID of the user. |
exp_id | string | The experiment ID. |
page | string | The page. |
net_type | string | The network type. |
event_time | bigint | The time when the behavior occurs. |
item_id | bigint | The ID of the item. |
event | string | The type of behavior. |
playtime | double | The playback duration or the reading duration |
ds | string | The name of the partition column in the table. |
Procedure
Step 1: Go to Machine Learning Designer
Log on to the PAI console.
In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
In the left-side navigation pane, choose Model Development and Training > Visualized Modeling (Designer).
Step 2: Build a pipeline
On the Visualized Modeling (Designer) page, click the Preset Templates tab.
In the Recommended Solution - Feature Engineering section of the template list page, click Create.
In the Create Pipeline dialog box, configure the parameters. You can use the default values.
The value specified for the Pipeline Data Path parameter is an Object Storage Service (OSS) bucket path that is used to store temporary data and models generated during the runtime of the pipeline.
Click OK.
It takes approximately about 10 seconds to create a pipeline.
In the pipeline list, double-click Recommended Solution - Feature Engineering to enter the pipeline.
View the components of the pipeline on the canvas, as shown in the following figure. The system automatically creates the pipeline based on the preset template.

Component
Description
1
Preprocesses the item table.
Replaces the tag feature delimiter with a
chr(29)for the subsequent feature generation (FG).Generates a feature that indicates whether the item is new.
2
Preprocesses the behavior table: outputs the derived features of behavior time, such as day_h and week_day.
3
Preprocesses the user table.
Generates a feature that indicates whether a user is new.
Replaces the tag feature delimiter with a
chr(29)for the subsequent feature generation (FG).
4
Associates the behavior table, user table, and item table to generate a wide behavior log table that has statistical properties.
5
Generates the item feature table, which contains the statistical features of items for a period.
item__{event}_cnt_{N}d:the number of certain behaviors that occurred on the item in N days, which indicates the popularity of the item.item__{event}_{itemid}_dcnt_{N}d:the number of unique users who performed certain behaviors on the item within N days, which indicates the popularity of the item.item__{min|max|avg|sum}_{field}_{N}d:the statistical distribution of a specific numeric property of users that perform positive behaviors on the item within N days, which indicates the numeric property of users that prefer the item.item__kv_{cate}_{event}_{N}d:the statistics of a certain categorical property of the users that perform certain behaviors on the item within N days, which indicates the categorical property of users that prefer the item.
6
Generates the user feature table, which contains the statistical features of users for a period.
Step 3: Add a function
Create a workflow. For more information, see Create a workflow.
Right-click MaxCompute below the workflow that you create and choose to create a Python script that is named count_cates_kvs.py. For more information, see Create and use MaxCompute resources.
Right-click MaxCompute below the workflow that you create and select Create Function to create a MaxCompute function that is named COUNT_CATES_KVS. Set class name to
count_cates_kvs. CountCatesKVS, and Resources tocount_cates_kvs.py. For more information, see Create and use a MaxCompute function.
Step 4: Run the workflow and view the results
The dataset that is used in this example uses data of 45 days and takes a long time to run. If you want the workflow to run for a shorter period, perform the following operations:
Change the execution time window parameters to use data of a shorter period of time.
Click the following components and change the Execution time window parameter on the Parameters Setting tab from
(-45,0]to the(-9,0]:1_rec_sln_demo_item_table_preprocess_v22_rec_sln_demo_behavior_table_preprocess_ v23_rec_sln_demo_user_table_preprocess_v24_rec_sln_demo_behavior_table_preprocess_wide_v2
Click the following components and change the Execution time window parameter on the Parameters Setting tab from
(-31,0]to the(-8,0]:5_rec_sln_demo_item_table_preprocess_all_feature_v26_rec_sln_demo_user_table_preprocess_all_feature_v2
Modify the SQL script to use a smaller sample size of users.
Click the component
2_rec_sln_demo_behavior_table_preprocess_ v2. On the Parameters Setting tab, change the line 32 of the SQL Script parameter fromWHERE ds = '${pai.system.cycledate}'toWHERE ds = '${pai.system.cycledate}' and user_id %10=1.Click the component
3_rec_sln_demo_user_table_preprocess_v2. On the Parameters Setting tab, change the line 38 of the SQL Script parameter fromWHERE ds = '${pai.system.cycledate}'toWHERE ds = '${pai.system.cycledate}' and user_id %10=1.
Click the Run button
in the top toolbar of the pipeline canvas. After you run the pipeline, check whether the following MaxCompute tables have data for 30 days:
Item feature table:
rec_sln_demo_item_table_preprocess_all_feature_v2Behavior log wide table:
rec_sln_demo_behavior_table_preprocess_v2User feature table:
rec_sln_demo_user_table_preprocess_all_feature_v2
You can query data in the preceding tables by using SQL statements. For more information, see DataWorks.
NoteA full table scan is disabled for a partitioned table of the project to which the table belongs. The partition that you want to scan must be specified. If you want to scan a full table by using an SQL statement, you can add
set odps.sql.allow.fullscan=trueto the SQL statement and commit the command together with the SQL statement. A full table scan increases data inputs and costs.