Vector-based Recall Evaluation - Platform For AI - Alibaba Cloud Documentation Center

The Vector-based Recall Evaluation component is used to calculate the hit rate of recalls. The hit rate can be used to evaluate the precision of recalls. A higher value indicates a higher precision of recalls that are performed by using the vectors generated during model training. This topic describes how the Vector-based Recall Evaluation component works and how to configure the component.

How it works

The Vector-based Recall Evaluation component supports u2i recalls and i2i recalls. During u2i recalls, user vectors are used to recall the top K items. During i2i recalls, item vectors are used to recall the top K times. The hit rate is calculated by using the following method: For example, the collection of relevant items (user vectors for u2i recalls and item vectors for i2i recalls) that triggers the recall is M. Top K items similar to the trigger items are recalled. N recalled items fall into the M collection. In this scenario, the top K hit rate is calculated based on the |N|/|M| formula. The component also outputs items that do not fall into the M collection and the corresponding distance values for bad case studies. The Vector-based Recall Evaluation component can run in standalone mode or distributed mode. Procedure:

All workers load the user embedding table or item embedding table to build indexes that are required by k-nearest-neighbor (KNN).
The workers look up the true sequence table for the K nearest neighbors in batches based on the embedding table and then return the top K items.
The component calculates the hit rate by comparing the sequence values of the items in the true sequence table with the sequence values of the top K items.
The component aggregates the results and outputs the results to MaxCompute tables.

Input

Item embedding table

An embedding table that stores item vectors. An item embedding table is usually generated by a training algorithm such as GraphSAGE. Example:

item id (bigint)	item embeddings (string)
23456677	0.1,0.2,0.3....

User embedding table

An embedding table that stores user vectors. A user embedding table is usually generated by a training algorithm such as GraphSAGE. Example:

user id (bigint)	user embeddings (string)
12345	0.1,0.2,0.3....

True sequence table

A true table that stores triggers and relevant items. The table is used as the ground truth. For u2i recalls, the trigger id column is mapped to the user id column. For i2i recalls, the trigger id column is mapped to the item id column. In the following true sequence table, the trigger id column is mapped to the item id column.

trigger id (bigint)	item ids (string)
12345	23456677,2233445,6837292,...

Output

total_hitrate table

A table that stores the total hit rate values. Example:

hitrate(double)

0.4

hitrate_details table

A table that stores hit rate details. The hitrate_details table contains the same number of rows as the true sequence table. Example:

id (bigint)	topk_ids (string)	topk_dists (string)	hitrate (double)	bad_ids (string)	bad_dists (string)
1123	2345,2367,2483,2567	0.8,0.7,0.2,0.1	0.39	2483,2567	0.2,0.1

The number of rows in this table is the same as that of the true sequence table.

For u2i recalls, the id column is mapped to the user_id column. For i2i recalls, the id column is mapped to the item_id column.
The topk_ids column displays the IDs of the top K items that are relevant to the trigger items. The item IDs are separated by commas (,).
The topk_dists column displays the distance to the items in the topk_ids column.
The hitrate column displays the hit rate of the recalled items corresponding to each trigger.
The bad_ids column displays the items that are recalled but are irrelevant.
The bad_dists column displays the distance to the items in the bad_cases column.

Component parameters

You can configure the Vector-based Recall Evaluation component in the console or by using the CLI. The parameters in the console and at the CLI are the same. The following table describes the parameters.

Parameter		Type	Description
Input	item_emb_table (Item vector table)	string	The item embedding table.
	true_seq_table (True sequence table)	string	The true sequence table. For u2i recalls, the table contains users and user-relevant items. For i2i recalls, the table contains items and item-relevant items. Important When you evaluate the precision of recalls, if the data in the embedding table is collected at the point in time of T, data in the true sequence table must be collected at the point in time of T+1. Otherwise, the actual hit rate will be higher than the expected value.
	user_emb_table (User vector table)	String (optional)	The user embedding table. You need to provide this table only for u2i recalls.
Output	total_hitrate (Vector-based recall hit rate)	string	An output table that contains the total hit rate values.
Output	hitrate_details (Vector-based recall hit rate details)	string	An output table that contains the hit rate details.
Parameters	recall_type (Recall type)	string	The type of recall, which can be 'u2i' or 'i2i'.
	emb_dim (Vector table feature dimension)	int	The embedding dimension of the embedding table.
	k (Number of recalled items)	int	The number of recalled items.
	metric (Similarity measurement)	int (Optional. Default value: 1)	The method that is used to measure the similarity. If you set the parameter to 0, L2 distance is used to measure the similarity. If you set the parameter to 1, inner products are used to measure the similarity. If L2 distance is used, top K items with the shortest distance are returned. If inner products are used, top K items with the greatest inner products are returned.
	strict (Whether to enable the strict mode)	bool (Optional. Default value: False)	Deviations exist in similarity calculation. If you set the strict parameter to True, the system compares the similarity of vectors in strict mode. However, similarity calculation in this mode is time-consuming.
	lifecycle	int (Optional. Default value: 7)	Output The lifecycle of the table in days.
Tuning	batch_size	int (Optional. Default value: 1024)	The number of samples per batch. Set the parameter to a small value if memory resources are limited.
	worker_count (Compute Cores)	int (Optional. Default value: 1)	The number of workers that are used to train the model. Set this parameter to a large value when the size of the input table is large or one worker is not enough to train the model efficiently.
	worker_memory (Memory Per core)	int (Optional. Default value: 20000)	The amount of memory allocated to each worker. Unit: megabytes. Default: 20000 MB.

Sample command

pai -name hitrate_gl_ext
		-Ditem_emb_table='item_emb_table'
    -Duser_emb_table='user_emb_table'
    -Dtrue_seq_table='true_seq_table'
    -Dhitrate_details='hitrate_details'
    -Dtotal_hitrate='total_hitrate'
    -Drecall_type='u2i'
    -Dk=5
    -Demb_dim=10
    -Dmetric=1
    -Dstrict=False
    -Dbatch_size=1024
    -Dworker_count=1
    -Dworker_memory=20000
    -Dlifecycle=7;

The preceding example shows how to calculate the hit rate of u2i recalls. Inner products are used to measure the similarity (distance) of vectors, the strict mode is disabled, and the calculation is performed in batches. One worker is used to process 1,024 items from the true_seq_table table for each batch. The worker is allocated 20 GB of memory and the lifecycle of the hitrate_details and total_hitrate output tables is seven days.