Configure DSSM Vector Recall in PAI Designer - Platform for AI - Alibaba Cloud - Platform For AI

This tutorial shows you how to build a Deep Structured Semantic Model (DSSM) two-tower vector recall pipeline using PAI's visual pipeline designer. By the end, you will have:

Created a DSSM recall pipeline from a preset template
Trained a model using negative sampling on 30-day behavioral data
Generated user and item vectors for recall
Evaluated recall quality using hit_rate@top200

Prerequisites

Before you begin, complete the feature engineering workflow to generate the following datasets:

rec_sln_demo_user_table_preprocess_all_feature_v2
rec_sln_demo_item_table_preprocess_all_feature_v2
rec_sln_demo_behavior_table_preprocess_v2

Open Machine Learning Designer

Log on to the Machine Learning Platform for AI (PAI) console.
In the left-side navigation pane, click Workspaces, then click the name of the workspace you want to use.
In the left-side navigation pane, choose Model Development and Training > Visualized Modeling (Designer).

Create the pipeline

On the Visualized Modeling (Designer) page, click the Preset Templates tab.
In the Recommended Solution - Vector Recall section, click Create.
In the Create Pipeline dialog box, configure the parameters. You can keep the default values.
The Pipeline Data Path parameter specifies an Object Storage Service (OSS) bucket path where the pipeline stores temporary data and model artifacts during runtime.
Click OK.
Pipeline creation takes about 10 seconds. When Recommended Solution - Vector Recall appears in the pipeline list, the pipeline is ready.
Double-click Recommended Solution - Vector Recall to open the pipeline on the canvas.

Understand the pipeline components

The preset template creates a complete DSSM recall pipeline. The components run in the following order:

Pipeline canvas showing components 1 through 18

Component	What it does
1	The sample model `DSSM_Recall` for vector recall
2	Applies feature generation (FG) to the sample model
3	Creates a positive sample table and uses it for negative sampling training
4	Applies equal frequency binning to numerical features to set the model's `boundaries` parameter
5	Counts unique values of enumeration features to set the model's `embedding_dim` and `hash_bucket_size` parameters
6	Applies FG to item features
7	Applies FG to user features
8	Aggregates results from `rec_sln_demo_dssm_recall_30d_binning_v1` and `rec_sln_demo_dssm_recall_30d_count_v1` to calculate feature configuration and step configuration
9	Creates an item table for negative sampling
10	Discretizes 30-day sample data from the DSSM_Recall model to generate training samples
11	Specifies the EasyRec configuration file based on the output of component 8
12	You need to run component 11 to generate the EasyRec configuration file before model training
13	Runs inference with the split item model on `rec_sln_demo_dssm_recall_item_feature_fg_encoded_v1` to produce item vectors
14	Runs inference with the split user model on `rec_sln_demo_dssm_recall_user_feature_fg_encoded_v1` to produce user vectors
15	Creates a sequence table and evaluates recall quality using hit_rate
18	Calculates the final evaluation metric: hit_rate@top200

Note

Component 12 requires component 11 to complete first, because the EasyRec configuration file must exist before model training starts.

Note

The hit_rate evaluation in components 15 and 18 excludes new users and new items that appear on the evaluation day, because there is no historical data to recall for them.

Run the pipeline and view results

In the top toolbar of the canvas, click Run.
The pipeline runs all components in dependency order. Monitor progress by watching each component's status on the canvas. When all components show a completed status, the pipeline has finished.
Right-click component 18 (18_rec_sln_demo_recall_total_hit_rate_v1_2) and choose View Data > hit_rate_detail to see recall hit rate by segment.
Right-click component 18 and choose View Data > total_hit_rate to see the overall recall hit rate.