Deep recall model based on multi-task learning and negative feedback

One: Background  
A traditional recommendation system usually consists of two parts, namely Candidate Generation (candidate generation) and Ranking (sorting). Take the classic YouTube video recommendation in the following figure as an example [1], the whole system is divided into two layers: the first layer is Candidate Generation (candidate generation), responsible for quickly screening hundreds of candidate videos from the full amount of videos, this step is usually called Matching (recall); the second layer is Ranking (sorting), responsible for accurate scoring of hundreds of videos And reorder to determine the final order of results displayed to the user.

This paper mainly studies the Matching (recall) part. This part usually faces the entire recommended item set. It is necessary to retain as many highly relevant results as possible while ensuring speed. important influence. A series of recent practices and studies have shown that deep learning recommendation models based on behavior sequences [2-4] combined with high-performance approximate retrieval algorithms [5] can achieve both accurate and fast recall performance (this scheme is usually called DeepMatch ), compared with traditional recall methods (such as swing, etrec, SVD), the advantages of DeepMatch are as follows:

Can model deeper non-linear relationships between user-items
Various user and item features can be used in the model
Models based on behavior sequences can model users' changing interests and can integrate users' long-term and short-term interests
DeepMatch has been widely used in Tmall Genie recommendation scenarios (APP information flow recommendation, music recommendation, etc.) and has achieved better results than traditional i2i methods, but there are still some problems in the current model, and some key information has not yet been obtained. Introduce it (take the "I want to listen to music" scene as an example, in this scene, Tmall Genie will recommend music to users according to their preferences):

Negative feedback signal (Play Rate)

Only positive feedback is included in the initial training log data, that is, the DeepMatch model is trained using song sequences with a high playback completion rate. In the scene of Tmall Genie, the user has the behavior of actively cutting songs, such as actively "stop playing" and "next song". Most of these types are triggered when users don't like songs. These signals can be Incorporated into the model as negative feedback from users. And some practices have shown the role of negative feedback [6-7]. If these signals can be effectively used, the model has the ability to capture the user's changing interests at all times, and reduce the same type of music when the user performs song switching behavior. recommend. In this scenario, we use the completion rate of each song to represent user feedback. A higher completion rate is positive feedback, and a smaller completion rate is negative feedback.

Song on demand query intent signal (Intent Type)

Most of Tmall Genie's songs are played by users' query requests. Behind each song is the user's request intention. Behind Tmall Genie, there is a set of special song query intent analysis, such as precise on-demand (song name on-demand: " Play Qilixiang", singer on demand: "I want to listen to Andy Lau's songs"), recommendation (style and genre recommendation: "Let's rock and roll", casual listening recommendation: "Sing a song"). According to the analysis of user behavior, songs under different intent types contribute different weights to the recommendation model. Integrating the intent attention corresponding to the song into the model can more accurately grasp the user's interest. Therefore, this paper proposes a deep recall model based on multi-task learning and negative feedback.

Two: method
In general, due to the limitation of the approximate nearest neighbor retrieval algorithm, our recall model needs to independently encode the user's historical behavior sequence to generate a vector representation of each step of the user, and then perform an inner product operation with the Target Item vector to obtain a score , the model is implemented based on the Self-Attention architecture, and the overall structure is as follows:

1 Input Representations
As mentioned before, in order to model negative feedback signals and user intent type signals, our model introduces representations for Play Rate and Intent Type. But the initial data set does not contain these two signals, so we use Train Set 1 to represent the initial data set, and Train Set 2 represents the data set with negative feedback signals and user intent type signals, and we have made a unified representation of them . In general, the representation of each Item in the user history behavior sequence consists of the following four parts:

1) Item Embedding

We first embed each Item into a fixed-size low-dimensional vector. Train Set 1 and Train Set 2 share a set of Item Vocabulary, no need to distinguish:

Among them, image.png is the One-Hot representation of Item, and image.png is the Item Embedding matrix. In addition, it should be noted that the output layer and the input layer share a set of Item Embedding. The purpose of this is to save the space occupied by the video memory, because it is recommended The number of items in the scene is huge, and a lot of work has proved that this approach will not have a great impact on model performance [2].

2) Position Embedding

For the behavior sequence task, we need to use positional embeddings to capture the sequential information in the behavior sequence. We adopt a method different from the sin and cos methods in Transformer's original paper [8], and directly use Max Sequence Length Embeddings to represent different positions.

3) Play Rate Embedding

The playback completion rate is an important feedback for the user's acceptance of the item. On Tmall Genie, users usually directly use the "next" command to cut off the songs they don't like, and the playback completion rate of these songs will decrease. relatively low. The playback completion rate is a continuous value with a value range between [0, 1]. In order to realize the interaction between the continuous value feature and the discrete value feature that will be mentioned later, we refer to the scheme in [9] and map the Play Rate to the The same low-dimensional vector space as Item Embedding. Specifically, we express Play Rate Embedding as:

Among them, image.png is the Play Rate, image.png is the randomly initialized Embedding, and image.png is the final Play Rate Embedding. Since our Train Set 1 data only has songs with a high playback duration, and there is no information about the playback completion rate , so for Train Set 1, we fix all Play Rates to 0.99:

4) Intent Type Embedding

The user intent type indicates the way the user enters the item. For example, on the Tmall Genie side, on-demand (songs that users explicitly order) and recommendation (songs that Tmall Genie recommends for users) are two different Intent Types (Tianmall Genie). There will be more types in the actual scene of cat sprites), similar to the representation of Item itself, we also map the Intent Type to a fixed low-dimensional vector space:

We don't know the Intent Type in the Train Set 1 data, here we assume that all the Intent Types of the Train Set 1 data are all on-demand:

2Factorized Embedding Parameterization
In the recommended tasks, the Vocabulary Size is usually huge, which makes it impossible for us to use a large Embedding Size to represent the Item, otherwise it will not be able to be stored. However, many works using Transformer prove that increasing the Hidden Size can effectively improve the model effect [10], referring to the way of ALBERT[11] compression model, we first map the One-Hot vector of Item to a low-dimensional space with a size of , and then map it back to a high-dimensional space and then input it into Transformer, so that the parameter quantity From reduced to , the amount of parameters will be significantly reduced at that time.

3 Feedback-Aware Multi-Head Self-Attention
After obtaining the feature representation of the user's behavior sequence, we are ready to construct the user vector, and we use the attention mechanism to encode the user's behavior sequence. As described in Transformer[8], the attention mechanism can be described as a function that maps Query and a series of Key-Value Pairs to Output, where Query, Key, Value and Output are all vectors, and Output is calculated as Value The weighted sum of , where the weight assigned to each Value is calculated by the scoring function of the Query and the corresponding Key. Due to the limitation of the recall model, we cannot use the Target Item to do calculations with the Item in the user behavior sequence in advance, so we use Self-Attention In this way, the user behavior sequence is directly calculated as Query, Key, and Value. Specifically, we use the Multi-Head Self-Attention approach to improve the model's ability to model complex interactions:

where image.png is the linear projection matrix and image.png is the number of heads. In the classic Transformer structure, image.png is composed of Item Embedding and Position Embedding. In our model, we introduce external information (play completion rate and user intent type) into Attention, which we call Feedback-Aware Attention, through the introduction of this information, our model has the ability to combine the user's feedback information to perceive the user's different preferences for each song:

In addition, we also tried the Cross-layer Parameter Sharing scheme with reference to ALBERT[11]. This scheme allows each Transformer Layer to share a set of parameters. Transformers under the same magnitude (the same number of layers) adopt this scheme. In the end, the actual effect will be reduced, but the number of parameters will be reduced a lot, and the training speed will be much faster. For scenarios where the scale of data is relatively large, this strategy can save training time as much as possible. Improve model performance.

4 Sampled Softmax Loss For Positive Feedback and Sigmoid CE Loss For Negative Feedback
After obtaining the user's vector representation, the goal of multi-task learning can be defined. Since we have unified the data representation of Train Set 1 and Train Set 2 before, these two tasks can be put together for joint multi-task training. Specifically, our task is similar to a language model: given the music that has been listened to before, predict what music the user wants to listen to in the first step, and we divide the signal into two types according to the completion rate of the playback (the actual implementation will use One is Positive Feedback, the optimization goal of which is to make the ranking as high as possible, and the other is Negative Feedback, the optimization goal of which is to make the ranking of scores as low as possible.

As shown in the left figure below, if we only do traditional Positive Feedback Loss, our model can shorten the distance between the user vector and the favorite Item (positive sample) and increase the distance between it and the unobserved Item, but not The distance between the favorite items (negative samples) may not be widened. We hope that our model can achieve the above characteristics while increasing the distance between the user vector and disliked items as shown in the figure on the right:

Therefore, we use the optimization goal of combining Positive Feedback and Negative Feedback:

For Positive Feedback, we use Sampled Softmax Loss to optimize. In many previous works [12-13], Sampled Softmax Loss has been proved to be very suitable for large-scale recommendation recall model, which is defined as follows:

When implemented with Tensorflow, the default sampler used by sampled_softmax is log_uniform_candidate_sampler (the TF version of Word2Vec is based on this implementation), which samples based on the Zipfian distribution. The probability of each item being sampled is only related to the ordering of its word frequency. The sampling probability is not the word , but items in recommendations and words in NLP may be different, so we tried two other samplers:

uniform_candidate_sampler: Use a uniform distribution to sample
learned_unigram_candidate_sampler: This distribution will dynamically count the vocabulary during the training process, then calculate the probability and sample
In our experiments, learned_unigram_candidate_sampler achieves the best results. In addition, we found a conclusion similar to [2] that the number of negative samples is a relatively important parameter. When the video memory allows, appropriately increasing the number of negative samples can effectively improve the convergence effect of the model.

For Negative Feedback, we directly use Sigmoid Cross Entropy Loss to optimize, and regard all items with low playback completion rate as negative examples to make the score as low as possible:

The final total Loss is the sum of the two:

Three: experiment
1 Distributed training
The size of the vocabulary and the amount of data in the recommendation scene are usually huge, so we adopted the ParameterServer strategy in TensorFlow to implement the distributed training code. Some experience has also been accumulated during the tuning process:

1) Need to pay attention to the problem of Embedding fragmentation, one is the size in tf.fixed_size_partitioner, and the other is the partition_strategy in embedding_lookup, you need to keep their settings in the training and inference process the same.

2) The code in sample_softmax needs to be optimized according to your own scenario. For example, in our scenario, the proportion of repeated sample items in the same batch is relatively high. Adding unique to the embedding_lookup inside can greatly increase the training speed.

3) Using the Mask mechanism flexibly can effectively implement various Attention strategies.

2 Experimental results
The indicator of the offline experiment adopts R@N, that is, the proportion of the top N results of scoring and sorting contains the Target Item. For positive samples, the indicator is called POS@N. We hope that the higher the value, the better. For negative samples, the indicator Called NEG@N, we want this value to be as low as possible. Due to space limitations, we only list a set of main experimental results. This set of experiments verifies the effects of Multitask Learning and Negative Feedback. When other strategies adopt the above optimal conditions (such as optimizers, sampling methods, etc.), a The traditional DM method is used for training, and only the positive feedback items with a high completion rate are reserved to construct the behavior sequence. b Add Play Rate and Play Type features on the basis of a. c Add negative feedback signals on the basis of b. Multitasking training:

It can be seen that after adding the feedback signal (Play Rate) and song-on-demand query intent signal (Intent Type), the effect of b is better than that of a, and after adding the Negative Feedback goal for Multitask Learning, c can guarantee the small change of POS@N In this case, NEG@N is significantly reduced.

We also proved the effectiveness of the new method in guessing your favorite online scene. Compared with the bucketing based on the original DM scheme, the per capita playing time of the new model (DM with Play Rate/Intent Type and NEG MTL) bucketing It has increased by +9.2%; at the same time, this method has been applied to more scenarios recommended by Tmall Genie and achieved good results.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us