Deep Analysis and Improvement of Cold Start Recommendation Model DropoutNet

Cold Start Recommendation Model Introduction:

Cold Start Recommendation Model, About the deep analysis and improvement of the cold start recommendation model DropoutNet.

Cold Start Recommendation Model, Why do you need a cold start

Usually recommender systems generate recommendation candidate sets through collaborative filtering, matrix decomposition or deep learning models, and these recall algorithms generally rely on user-item behavior matrices. In a real recommendation system, there will be a steady stream of new users and new items joining the system. Due to the lack of rich enough historical interaction behavior data, these newly added users and items often cannot obtain accurate recommended content, or are accurately recommended to the right user. This is known as the recommended cold start problem. Cold start is a challenge for recommendation systems. The reason is that the existing recommendation algorithms, whether it is recall, rough sorting or fine sorting , are not friendly to new users and new items , and they often rely too much on the system to collect data. , while the behavior data of new users and new items is very little. This results in fewer opportunities for new items to be displayed; the interests of new users cannot be accurately modeled.

For some businesses, it is very important to recommend new items in a timely manner and get enough exposure for new items for the ecological construction and long-term benefits of the platform. For example, the timeliness of news information is very strong, and if the opportunity to display is not obtained in time, its news value will be greatly reduced; if the self-media UGC platform cannot allow the newly released content to be displayed in a sufficient number in time, it will affect the enthusiasm of content creators, thus affecting the amount of high-quality content that the platform can accept in the future; if the dating platform cannot get enough attention from newly joined users, there may not be a steady stream of new users joining, thus making the platform lose its activity. .

To sum up, the cold start problem is very important in the recommendation system, so how to solve the cold start problem?

Cold Start Recommendation Model, How to Fix Cold Start Problems

The algorithm (or strategy) to solve the cold start problem of the recommendation system is summarized as the four-character formula of " general, fast, moving, and few ".

Generalization : that is, to generalize new items , relying on a broader concept in terms of attributes or topics. For example, a new product can be recommended to users who liked the same category in the past, that is, from " product" to "category"; a new short video can be recommended to users who have followed the author of the video , that is, from "product" to "category"; " Short video" is pushed up to "author"; a newly released news item can be recommended to users who like the same topic, such as recommending an article introducing " J-20" to a military fan, that is, from "news information" Push to "Topic". Essentially, this is a Content Based Recommandation . Of course, for better recommendation effect, we sometimes need to push up to multiple different " superior concepts" at the same time. For example, in addition to pushing up new products to "category", we can also push up to "brand", "shop", " Style", "Color", etc. The concept of push-up is sometimes inherent to new items. This situation is relatively simple. For example, the various attributes of the product are generally filled in by the merchant when the product is released; there are also some concepts that are not inherently there, such as the article's The topic, this article belongs to " military", "sports", "beauty" and other topics which need to be mined by another algorithm.

In addition to generalization on tags or topics, it is also a very common method to obtain the embedding vectors of users and items by some algorithm, and then use the distance/similarity of the vectors to match the interests of users and items. Algorithms such as matrix decomposition and deep neural network models can generate embedding vectors for users and items. However, conventional models still need to rely on the interactive behavior data of users and items to model, and cannot generalize well to cold-start users and items. on the item. There are also models that can be used to generate embedding vectors for cold-start users and items, such as DropoutNet , which is described in detail below .

Up or generalizing this method sounds simple and easy to understand, but there is still a lot of work that can be done to dig deeper. In essence, this is using the content (attribute) information of the item to compensate for the lack of historical interaction behavior of the new item. For example, multi-modal information of items, such as pictures, videos, etc., can be used to make relevant recommendations. For example, on a blind date platform, a new user (here referred to as a recommended item) can be given a score for the photo appearance , and then recommended to users with related appearance preferences (here, users who browse the recommended list).

Fast : In the world of martial arts, only fast is not broken . The so-called cold-start items, that is, items that lack historical user interaction behaviors, then a natural idea is to collect the interaction behaviors of new items faster and use them in the recommendation system. Conventional recommendation algorithm models and data are updated in units of days, and based on real-time processing systems, data and model updates can be achieved in minutes or even seconds. Such methods are usually based on reinforcement learning/contextual bandit-like algorithms. Here are two reference articles, so I won't go into details: " Implementation and Application of Contextual Bandit Algorithm in Recommender Systems ", " Experience and Pitfalls of Deploying Contextual Bandit Algorithm in Recommender Systems in Production Environment ".

Migration : Transfer learning is a method of building models by recalling data from different scenarios. Knowledge can be transferred from the source domain to the target domain through transfer learning . For example, a new business has only a small number of samples and needs to be modeled with data from other scenarios. At this time, other scenarios are the source domain, and the new business scenario is the target domain. For another example, some cross-border e-commerce platforms have different sites in different countries, and some sites are newly opened and only have very little user interaction behavior data. At this time, the interaction behavior data of other mature sites in other countries can be used. To train the model, and use a small number of samples from the current national site to do fine-tune, it can also have a good cold start effect. When using transfer learning technology, it should be noted that the source domain and the target domain need to have a certain correlation. For example, a large part of the products that may be sold by sites in different countries may overlap.

Few : few-shot learning techniques, as the name suggests, are techniques for training models using only a small amount of supervised data. One of the typical few-shot learning methods is meta learning. Since the purpose of this article is not to introduce these learning techniques, I will not introduce them too much. Interested students can refer to: " Cold Start Recommendation Model Based on Meta-Learning

This article mainly introduces a method based on "generalization". Specifically, we will introduce in detail an embedding learning model that can be applied to completely cold start scenarios: DropoutNet . The original DropoutNet model needs to provide the embedding vectors of users and items as input supervision signals. These embedding vectors usually come from other algorithm models, such as matrix decomposition, which increases the threshold for model use. This paper proposes an end-to-end training method, which directly uses the user's interactive behavior as the training target, which greatly reduces the threshold for using the model.

In addition, in order to make the learning of the model more efficient, this paper adds two new loss functions on the basis of the pointwise loss function of the conventional two-class prediction model: one is the rank loss that focuses on improving the AUC indicator; the other is is the Support Vector Guided Softmax Loss for improved recall . The latter innovatively adopts a negative sampling technology called "Negative Mining". During the training process, negative sample items are automatically sampled from the current mini batch, thereby expanding the sample space and achieving better learning effects. .

Therefore, the contributions of this paper are mainly in two points , which are summarized as follows:

1.This paper modifies the original DropoutNet model and directly uses the user-item interaction behavior data as the training target for end-to-end training, thereby avoiding the need to use other models to provide user and item embeddings as supervision signals .
2.The text innovatively proposes a multi-task learning framework that uses various types of loss functions, and uses Negative Mining's negative sampling technology during the training process. During the training process, negative samples are sampled from the current mini batch, expanding the The sample space makes learning more efficient and is suitable for scenarios with a relatively small amount of training data .

Cold Start Recommendation Model, DropoutNet model analysis

The NIPS 2017 article " DropoutNet: Addressing Cold Start in Recommender Systems" introduces a recall model that works for both head users and items, as well as mid-and long-tail, and even brand new users and items.

Cold Start Recommendation Model, dropoutNet is a typical dual structure, the user tower is used to learn the latent space vector representation of the user; correspondingly, the item tower is used to learn the latent space vector representation of the item. When the user has some kind of interactive behavior with the current item, such as clicking or purchasing, the loss function design of the model sets the distance between the vector representation of the user and the vector representation of the item as close as possible; When the item produces any interactive behavior, the corresponding user and item pair constitute a negative sample, and the model will try to make the vector representation of the user in the corresponding sample as far as possible from the vector representation of the item.

Cold Start Recommendation Model, in order to make the model suitable for any stage of the recommendation system, it can not only be used to learn the vector representation of head users and items, but also can be used to learn the vector representation of medium and long tails, or even brand-new users and items, DropoutNet combines users and items. The features are divided into two parts: content features and preference statistical features. The content characteristics are relatively stable and do not change very often, and the corresponding information is generally collected when the user registers or the item goes online. On the other hand, the preference statistics feature is based on the statistics of the interaction log, which is dynamic and will change with time. Brand new users and items will not have preference statistics because they have no corresponding interaction behavior.

Cold Start Recommendation Model, So how does DropoutNet make the model suitable for learning completely new vector representations of items and users? In fact, the idea is very simple, drawing on the idea of dropout in deep learning, and forcibly setting some features of the input to 0 according to a certain probability, that is, the so-called input dropout. Note that the dropout here does not act on the neurons of the neural network model, but directly on the input node. Specifically, the statistical features of user and item preferences have a certain probability to be set to 0 during the learning process, while the features of the content dimension will not be subjected to dropout operations.

Cold Start Recommendation Model, according to the introduction of the paper, DropoutNet draws on the idea of denoising autoencoder, that is, the training model accepts the corrupted input to reconstruct the original input, that is, it learns a model so that it can be used when some input features are missing. A more accurate vector representation can still be obtained. Specifically, the model is to make the correlation score between the user vector and the item vector learned when the input is corrupted as close as possible to the user learned when the input is not corrupted. Correlation score between vector and item vector .

The objective function is:
O=∑ u,v ( UuVvT − fU ( Uu , Φ uU ) fV ( Vv , Φ vV )T)2= ∑ u,v ( UuVvT − U^uV^vT )2
Among them, U^u is the user vector representation learned by the model, V^v is the item vector representation learned by the model; ** Uu and Vv are the externally inputted user and item vector representations as supervision signals, generally through Other models learn to get**.

In order to make the model suitable for the user's cold start scenario, dropout is performed on the user's preference statistics during the training process:
user cold start: Ouv =( UuVvT − fU (0, Φ uU ) fV ( Vv , Φ vV )T)2

Cold Start Recommendation Model, In order to make the model suitable for the item cold start scenario, dropout is performed on the user's preference statistics during the training process:
item cold start: Ouv =( UuVvT − fU ( Uu , Φ uU ) fV (0, Φ vV )T)2

DrouputNet model is shown in Algorithm 1:

End-to-end training transformation

DropoutNet model is that it needs to provide the embedding vectors of users and items as supervision signals. The model masks part of the input features through dropout, and tries to learn a vector representation that can reconstruct the similarity between the user and the item embedding vector through some of the input features. The principle is similar to the noise reduction auto-encoder. This means that we need another model to learn the embedding vectors of users and items. From the perspective of the whole process, it is necessary to complete the learning goal in two stages. The first stage is to train a model to obtain the embedding vectors of users and items. The second stage Training a DropoutNet model results in a more robust vector representation and is able to apply to new cold-start users and items.

In order to simplify the training process, we propose an end-to-end training method. Under the new training method, it is no longer necessary to provide the embedding vector used for the item as a supervision signal. Instead, we use the user's interaction behavior with the item as Monitor signals. For example, similar to the click-through rate estimation model, if a user clicks on an item, the user and the item constitute a positive sample; those items displayed to the user but not clicked constitute a negative sample. Through the design of the loss function, the model can learn that the similarity between the vector representation of the user and the item in the positive sample is as high as possible, and the similarity between the user and the vector representation of the item in the negative sample is as low as possible. For example, the following loss function can be used:
L= − [ ylog ( U^uV^v+T )+(1 − y)log(1 − U^uV^v − T)]

Among them, y∈{0,1} is the target of model fitting; v+ represents the item that interacts with user u; v − represents the item that does not interact with user u.

Cold Start Recommendation Model, Online Negative Sampling & Loss Function

As a model in the recall phase of a recommender system, it is not enough to use exposure logs to construct training samples, because usually users can only be shown a small number of items, and most items on the platform may never be exposed to current users. If these unexposed items are not constructed as samples with the current user, the model can only explore a small part of the potential sample space, which makes the generalization performance of the model weak.

Negative sample sampling is a commonly used technique for recall models, and it is also the key to ensuring the effect of the model. There are many methods for negative sampling. You can refer to Facebook's paper "Embedding-based Retrieval in Facebook Search", which will not be repeated here. The following only talks about how to do sample negative sampling from the perspective of implementation.
There are usually two approaches to sample negative sampling, as shown in the following table.
Negative sampling methodadvantageshortcoming
Offline Negative SamplingSimple to implementLimited sample space and slow training speed
Online Negative SamplingDynamically expand the sample space during training, and the training is fastermore complex to implement
There are also different implementations of online sample negative sampling. For example, a global shared memory can be used to maintain the set of items to be sampled. One disadvantage of this method is that it is more complicated to implement. Under normal circumstances, we will collect and aggregate user behavior logs for multiple days to construct samples. The total amount of samples is very large and cannot all be put into memory. When the same item appears in samples of multiple days, the corresponding statistical features are also different, and the problem of feature traversal may occur with a little improper handling.
Another, more nifty implementation is to sample from the current mini-batch. Because the training data needs to be shuffled globally and then used to train the model, so that the sample sets in each mini-batch are randomly sampled. When we sample negative samples from the mini-batch, it is theoretically equivalent to The global sample is negatively sampled. This method is relatively simple to implement, and this online sampling method is adopted in this paper.
Specifically, in the training process, after the user and item features perform the forward stage of the network , the user embedding and item embedding are obtained. Next, we perform a row-by-row offset operation ( row- wise roll), move the rows of the matrix (corresponding to the item embedding) down by N rows as a whole, and then re-insert the N rows out of the matrix into the top N rows of the matrix, which is equivalent to moving in one direction in a circular queue. N steps. In this way, a negative sample user item pair is obtained, and M negative sample pairs are obtained by repeating the above operation M times.

Cold Start Recommendation Model, The transformed DropoutNet network is shown in the figure above. First, calculate the cosine similarity between the user semantic vector and the positive sample items, denoted as R( u,i +); then calculate the cosine similarity between the user semantic vector and N negative sample items , denoted as R(u,i1 − ) , ⋯ ,R( u,iN − ), perform softmax transformation on the N+1 similarity scores to obtain the user's preference probability for items; the final loss function is the negative logarithm of the user's preference probability for positive sample items, as follows :
L= − log(P( i +|u))= − log(exp(R( u,i +))exp(R( u,i +))+ ∑ j ∈ Negexp (R( u,ij − ) ))

Cold Start Recommendation Model, Further, we refer to the idea of the paper " Support Vector Guided Softmax Loss for Face Recognition ", and introduce the practice of maximum interval and support vector in the process of implementing the softmax loss function. "wrong" approach, forcing the model to challenge more difficult tasks during training, making the model more robust and making it easier to make correct judgments during the prediction phase.

tensorflow implementation code of support vector guided softmax loss based on negative sampling is as follows:
def softmax_loss_with_negative_mining ( user_emb , _
item_emb ,
num_negative_samples =4,
embed_normed =False,
"""Compute the softmax loss based on the cosine distance explained below.
Given mini batches for `user_emb` and `item_emb`, this function computes for each element in `user_emb`
the cosine distance between it and the corresponding `item_emb`,
and additionally the cosine distance between `user_emb` and some other elements of `item_emb`
(referred to a negative samples).
The negative samples are formed on the fly by shifting the right side (`item_emb`).
Then the softmax loss will be computed based on these cosine distance.
user_emb: A `Tensor` with shape [batch_size, embedding_size]. The embedding of user.
item_emb: A `Tensor` with shape [batch_size, embedding_size]. The embedding of item.
labels: a `Tensor` with shape [batch_size]. e.g. click or not click in the session. It's values must be 0 or 1.
num_negative_samples: the num of negative samples, should be in range [1, batch_size).
embed_normed: bool, whether input embeddings l2 normalized
weights: `weights` acts as a coefficient for the loss. If a scalar is provided,
then the loss is simply scaled by the given value. If `weights` is a
tensor of shape `[batch_size]`, then the loss weights apply to each corresponding sample.
gamma: smooth coefficient of softmax
margin: the margin between positive pair and negative pair
t: coefficient of support vector guided softmax loss
support vector guided softmax loss of positive labels
batch_size = get_shape_list(item_emb)[0]
assert 0 < num_negative_samples < batch_size, '`num_negative_samples` should be in range [1, batch_size)'
if not embed_normed:
user_emb = tf.nn.l2_normalize(user_emb, axis=-1)
item_emb = tf.nn.l2_normalize(item_emb, axis=-1)
vectors = [item_emb]
for i in range(num_negative_samples):
shift = tf.random_uniform([], 1, batch_size, dtype=tf.int32)
neg_item_emb = tf.roll(item_emb, shift, axis=0)
# all_embeddings's shape: (batch_size, num_negative_samples + 1, vec_dim)
all_embeddings = tf.stack(vectors, axis=1)
mask = tf.greater(labels, 0)
mask_user_emb = tf.boolean_mask(user_emb, mask)
mask_item_emb = tf.boolean_mask(all_embeddings, mask)
if isinstance(weights, tf.Tensor):
weights = tf.boolean_mask(weights, mask)
# sim_scores's shape: (num_of_pos_label_in_batch_size, num_negative_samples + 1)
sim_scores = tf.keras.backend.batch_dot(
mask_user_emb, mask_item_emb, axes=(1, 2))
pos_score = tf.slice(sim_scores, [0, 0], [-1, 1])
neg_scores = tf.slice(sim_scores, [0, 1], [-1, -1])
loss = support_vector_guided_softmax_loss(
pos_score, neg_scores, margin=margin, t=t, smooth=gamma, weights=weights)
return loss
def support_vector_guided_softmax_loss(pos_score,
"""Refer paper: Support Vector Guided Softmax Loss for Face Recognition ("""
new_pos_score = pos_score - margin
cond = tf.greater_equal(new_pos_score - neg_scores, threshold)
mask = tf.where(cond, tf.zeros_like(cond, tf.float32),
tf.ones_like(cond, tf.float32)) # I_k
new_neg_scores = mask * (neg_scores * t + t - 1) + (1 - mask) * neg_scores
logits = tf.concat([new_pos_score, new_neg_scores], axis=1)
if 1.0 != smooth:
logits *= smooth
loss = tf.losses.sparse_softmax_cross_entropy(
tf.zeros_like(pos_score, dtype=tf.int32), logits, weights=weights)
# set rank loss to zero if a batch has no positive sample.
loss = tf.where(tf.is_nan(loss), tf.zeros_like(loss), loss)
return loss
source code:
Pairwise Ranking
Pointwise, pairwise and listwise are three well-known optimization objectives in the field of LTR (Learning to Rank). As early as before the era of deep learning, IR researchers have developed a series of basic methods. For more classic work, please refer to " Learning to Rank". to Rank Using Gradient Descent " and " Learning to Rank- From Pairwise Approach to Listwise Approach ".

Cold Start Recommendation Model, the significance of Pairwise is to make the training objectives of the model and the actual tasks of the model as unified as possible. For a ranking task, the real goal is to make the estimated score of positive samples higher than that of negative samples , corresponding to indicators such as AUC. In pairwise's classic paper RankNet , the optimization objective of pairwise is written as,
Cij = - yijlogPij - (1 - yij ) log (1 - Pij ) Pij = ef (xi) - f ( xj ) 1 + ef (xi) - f ( xj )
Here Pij represents the probability that the model predicts that sample i is more "relevant" than j, where f(xi) − f( xj ) is the difference between the pointwise output logit of the two sample models. Intuitively, optimizing Cij is to improve the probability that the model has a higher score for any positive sample than any negative sample, that is, AUC, so this form of pairwise loss is also called AUC loss.
Similarly, in order to facilitate the implementation and reduce the workload of offline construction of pair samples, we choose the In-batch Random Pairing method to calculate the pairwise rank loss by constructing pairs from the mini batch during the training process. The specific implementation code is as follows:
def pairwise_loss ( labels, logits) :
pairwise_logits = tf.expand_dims(logits, -1) - tf.expand_dims(logits, 0)'[pairwise_loss] pairwise logits: {}'.format(pairwise_logits))
pairwise_mask = tf.greater(
tf.expand_dims(labels, -1) - tf.expand_dims(labels, 0), 0)'[pairwise_loss] mask: {}'.format(pairwise_mask))
pairwise_logits = tf.boolean_mask(pairwise_logits, pairwise_mask)'[pairwise_loss] after masking: {}'.format(pairwise_logits))
pairwise_pseudo_labels = tf.ones_like(pairwise_logits)
loss = tf.losses.sigmoid_cross_entropy(pairwise_pseudo_labels,
pairwise_logits )
# set rank loss to zero if a batch has no positive sample.
loss = tf.where ( tf.is_nan (loss), tf.zeros_like (loss), loss)
return los
Source code:
Model implementation open source code

released the source code of DropoutNet in EasyRec , the open source recommendation algorithm framework of the Alibaba Cloud Machine Learning PAI team . HYPERLINK "" "_blank" Please check the usage documentation: .

Cold Start Recommendation Model, EasyRec is an easy-to-use recommendation algorithm model training framework. It has built-in many state-of-the-art recommendation algorithm models, including various algorithms suitable for the recall, sorting, and cold-start phases of recommendation systems. It can run on multiple platforms such as local, DLC, MaxCompute , DataScience , etc. It supports loading various formats (text, csv, table, tfrecord ) from various storage media (local, hdfs , maxcompute table, oss , kafka ) training and evaluation data. EasyRec supports multiple types of features, loss functions, optimizers and evaluation metrics, and supports massively parallel training. With EasyRec , you only need to configure the config file, and you can implement training, evaluation, export, inference and other functions through command calls without code development, helping you quickly build a promotion search algorithm.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us