In-depth analysis and improvement of cold start recommendation model DropoutNet

Why do you need a cold start
Usually, recommendation systems generate recommendation candidate sets through collaborative filtering, matrix decomposition, or deep learning models. These recall algorithms generally rely on user-item behavior matrices. In a real recommendation system, there will be a steady stream of new users and new items added. Due to the lack of sufficient and rich historical interactive behavior data, these newly added users and items often cannot obtain accurate recommendation content or be accurately recommended to suitable user. This is the so-called recommendation cold-start problem. Cold start is a challenge for recommendation systems. The reason is that the existing recommendation algorithms, whether it is recall, rough sorting or fine sorting modules, are not friendly to new users and new items. They often rely too much on the system to collect user behavior data, while the behavior data of new users and new items is very little. As a result, the opportunities for new items to be displayed are relatively small; the interests of new users cannot be accurately modeled.
For some businesses, it is very important for the ecological construction and long-term benefits of the platform to recommend new items in a timely manner so that new items can gain sufficient exposure. For example, the timeliness of news information is very strong, if the opportunity to display it cannot be obtained in time, its news value will be greatly reduced; if the self-media UGC platform cannot obtain a sufficient number of newly released content in a timely manner, it will affect the enthusiasm of content creators , thereby affecting the amount of high-quality content that the platform can accommodate in the future; if the dating platform cannot attract new users enough attention, then there may not be a steady stream of new users joining, thus making the platform lose activity .
In summary, the cold start problem is very important in the recommendation system, so how to solve the cold start problem?
How to Fix the Cold Start Problem
The algorithm (or strategy) to solve the cold start problem of the recommendation system is summarized as: "pan, fast, move, less" four-character formula.
Pan: that is, to generalize new items, relying on a broader concept in terms of attributes or themes. For example, if a product is newly launched, it can be recommended to users who liked the same category in the past, that is, from "product" to "category"; "Short Video" is pushed up to "Author"; a newly released piece of news information can be recommended to users who like the same topic, such as recommending an article introducing "J-20" to a military fan, that is, from "News" Push to "Theme". Essentially, this is a content-based recommendation (Content Based Recommendation). Of course, for a better recommendation effect, we sometimes need to push up multiple different "superior concepts" at the same time. For example, in addition to pushing up new products to "category", we can also push up to "brand", "store", " Style", "Color" and so on. The concept of push-up is sometimes inherent in new items. This situation is relatively simple. For example, the various attributes of the product are generally filled in by the merchant when the product is released; Topic, this article belongs to "military", "sports", "beauty makeup" and other topics that need to be mined by another algorithm.
In addition to generalization on tags or topics, it is also a very common method to use some algorithm to obtain the embedding vectors of users and items, and then use the distance/similarity of the vectors to match the interests of users and items. Algorithms such as matrix decomposition and deep neural network model can generate embedding vectors of users and items. However, conventional models still need to rely on the interaction behavior data of users and items to model, and cannot generalize well to cold-start users and items. item. There are also some models that can be used to generate embedding vectors for cold start users and items, such as DropoutNet, which will be described in detail below.
Although the method of pushing up or generalizing sounds simple and easy to understand, there is still a lot of work that can be done to dig deeper. Essentially, this is using the item's content (properties) information to compensate for the lack of historical interaction behavior for this new item. For example, multimodal information of items, such as pictures and videos, can be used to make relevant recommendations. For example, on a dating platform, you can score a new user’s photo appearance (referred to here as a recommended item), and then recommend it to users with related appearance preferences (here refers to users browsing the recommendation list).
Fast: The only martial art in the world is fast. The so-called cold start items, that is, items that lack historical user interaction behavior, then a natural idea is to collect the interaction behavior of new items faster and use them in the recommendation system. Conventional recommendation algorithm models and data are updated in units of days, and based on real-time processing systems, data and models can be updated in minutes or even seconds. Such methods are usually based on reinforcement learning/contextual bandit algorithms. Here are two reference articles, so I won’t go into details: "Implementation and Application of Contextual Bandit Algorithm in Recommender System", "Experience and Pitfalls of Deploying Contextual Bandit Algorithm in Recommender System in Production Environment".
Transfer: Transfer learning is a method of building a model by calling data from different scenarios. Knowledge can be transferred from the source domain to the target domain through transfer learning. For example, a new business has only a small number of samples, and it needs to be modeled with data from other scenarios. At this time, other scenarios are the source domain, and the new business scenario is the target domain. For another example, some cross-border e-commerce platforms have different sites in different countries, and some sites are newly opened with only a small amount of user interaction behavior data. At this time, the interaction behavior data of other more mature sites in other countries can be used To train the model, and use a small number of samples from the current national site to do fine-tune, it can also have a good cold start effect. When using transfer learning technology, it should be noted that the source domain and the target domain need to be related to a certain extent. For example, a large part of the products sold by sites in different countries just mentioned overlap.
Few: Few-shot learning techniques, as the name implies, are techniques that use only a small amount of supervised data to train a model. Among them, the typical few-shot learning method is meta learning. Since the purpose of this article is not to introduce these learning techniques, I will not introduce too much. Interested students can refer to: "Cold Start Recommendation Model Based on Meta-Learning".
This article mainly introduces a method based on "generalization". Specifically, we will introduce in detail an embedding learning model that can be applied to a completely cold start scenario: DropoutNet. The original DropoutNet model needs to provide the embedding vectors of users and items as input supervision signals. These embedding vectors usually come from other algorithm models, such as matrix decomposition, etc., which increases the threshold for using the model. This paper proposes an end-to-end training method, which directly uses user interaction behavior as the training target, which greatly reduces the threshold for using the model.
In addition, in order to make the learning of the model more efficient, this paper adds two new loss functions based on the pointwise loss function of the conventional binary classification prediction model: one is the rank loss that focuses on improving the AUC index; the other is the Is the Support Vector Guided Softmax Loss used to improve the recall effect. The latter innovatively adopts a negative sampling technique called "Negative Mining". During the training process, negative sample items are automatically sampled from the current mini batch, thereby expanding the sample space and achieving better learning effects. .
Therefore, there are two main contributions of this paper, which are summarized as follows:
1. This paper modifies the original DropoutNet model, and directly uses the interaction behavior data between users and items as the training target for end-to-end training, thus avoiding the need to use other models to provide user and item embeddings as supervisory signals.
2. The text innovatively proposes a multi-task learning framework using various types of loss functions, and uses Negative Mining’s negative sampling technology during the training process to sample negative samples from the current mini batch during the training process. The sample space is expanded to make learning more efficient, and it is also suitable for scenarios with a relatively small amount of training data.
DropoutNet model analysis
The NIPS 2017 article "DropoutNet: Addressing Cold Start in Recommender Systems" introduced a recall model that is suitable for both head users and items, as well as mid- and long-tail, and even brand-new users and items.
DropoutNet is a typical dual structure, the user tower is used to learn the latent space vector representation of the user; correspondingly, the item tower is used to learn the latent space vector representation of the item. When the user has some kind of interactive behavior on the current item, such as clicking and purchasing, the loss function design of the model sets the distance between the vector representation of the user and the vector representation of the item as close as possible; When the item generates any interaction behavior, the corresponding user and item pair constitutes a negative sample, and the model will try to make the distance between the vector representation of the user and the vector representation of the item in the corresponding sample as far as possible.
In order to make the model applicable to any stage of the recommendation system, it can be used not only to learn the vector representation of head users and items, but also to learn the vector representation of middle and long tail, or even brand new users and items. DropoutNet combines the user and item Features are divided into two parts: content features, preference statistical features. The content characteristics are relatively stable and will not change frequently, and the corresponding information is generally collected when the user registers or the item goes online. On the other hand, the preference statistical feature is based on the feature obtained from the interactive row log statistics, which is dynamic and will change with time. Brand new users and items do not have preference statistics since they have no corresponding interactions.
So how does DropoutNet make the model suitable for learning completely new item and user vector representations? In fact, the idea is very simple. It borrows the idea of dropout in deep learning, and forcibly sets some features of the input to 0 according to a certain probability, which is the so-called input dropout. Note that the dropout here does not act on the neurons of the neural network model, but directly acts on the input node. Specifically, the preference statistical features of users and items have a certain probability to be set to 0 during the learning process, while the features of the content dimension will not be dropped out.
According to the introduction of the paper, DropoutNet draws on the idea of denoising autoencoder (denoising autoencoder), that is, the training model accepts the corrupted input to reconstruct the original input, that is, learns a model so that it can be used in the absence of some input features. A more accurate vector representation can still be obtained. Specifically, the model is to make the correlation score between the user vector and the item vector learned when the input is corrupted be as close as possible to the user whose input is not corrupted. The correlation score between the vector and the item vector.
The objective function is:
O=∑u,v(UuVvT−fU(Uu,ΦuU)fV(Vv,ΦvV)T)2=∑u,v(UuVvT−U^uV^vT)2
Among them, U^u is the user vector representation learned by the model, V^v is the item vector representation learned by the model; **Uu and Vv are the externally input user and item vector representations as supervisory signals, generally through Other models learn to get**.
In order to make the model suitable for the user's cold start scenario, the user's preference statistical features are dropped out during the training process:
user cold start: Ouv=(UuVvT−fU(0,ΦuU)fV(Vv,ΦvV)T)2
In order to make the model suitable for the item cold start scenario, the user's preference statistical features are dropped out during the training process:
item cold start: Ouv=(UuVvT−fU(Uu,ΦuU)fV(0,ΦvV)T)2
The learning process of the DrouputNet model is shown in Algorithm 1:
End-to-End Training Transformation
One of the drawbacks of the DropoutNet model is that it is necessary to provide embedding vectors of users and items as supervisory signals. The model masks part of the input features through dropout, and tries to learn a vector representation that can reconstruct the similarity between the user and the item embedding vector through part of the input features. The principle is similar to the denoising automatic encoding machine. This means that we need another model to learn the embedding vectors of users and items. From the perspective of the whole process, we need to complete the learning goals in two stages. The first stage trains a model to obtain the embedding vectors of users and items. The second stage Train the DropoutNet model to obtain a more robust vector representation, and can be applied to brand-new cold-start users and items.
In order to simplify the training process, we propose an end-to-end training method. In the new training method, it is no longer necessary to provide embedding vectors for and items as supervisory signals. Instead, we use the interaction between users and items as Supervision signal. For example, similar to the click-through rate prediction model, if a user clicks on an item, the user and the item constitute a positive sample; those items that are presented to the user but are not clicked constitute a negative sample. Through the design of the loss function, the model can learn that the similarity between the user and the item vector representation of the positive sample is as high as possible, and the similarity between the user and the item vector representation of the negative sample is as low as possible. For example, the following loss function can be used:
L=−[ylog(U^uV^v+T)+(1−y)log(1−U^uV^v−T)]
Among them, y∈{0,1} is the target of model fitting; v+ represents items that interact with user u; v− represents items that do not interact with user u.
Online Negative Sampling & Loss Functions
As a model in the recall phase of a recommendation system, it is not enough to use exposure logs to construct training samples, because usually users can only be shown a small number of items, and most items on the platform may never be exposed to current users. If these unexposed items are not constructed as samples with the current user, the model can only explore a small part of the potential sample space, making the generalization performance of the model weak.
Negative sampling of samples is a commonly used technique for recall models, and it is also the key to ensure the effect of the model. There are many methods of negative sampling, you can refer to Facebook's paper "Embedding-based Retrieval in Facebook Search”, so I won’t go into details here. The following will only talk about how to do sample negative sampling from the perspective of implementation.
There are usually two approaches to sample negative sampling, as shown in the following table.
negative sampling method
advantage
shortcoming
Offline Negative Sampling
easy to implement
Limited sample space and slow training
Online Negative Sampling
Dynamically expand the sample space during training, and the training is faster
more complicated to implement
Online sample negative sampling also has different implementations. For example, a global shared memory can be used to maintain the set of items to be sampled. A disadvantage of this method is that it is more complicated to implement. Under normal circumstances, we collect and aggregate user behavior logs for multiple days to construct samples. The total amount of samples is too large to fit into memory. When the same item appears in multiple days of samples, the corresponding statistical characteristics are also different, and the problem of feature crossing may occur if it is not handled properly.
Another more ingenious implementation is to sample from the current mini-batch. Because the training data needs to be globally shuffled (shuffle) and then used to train the model, the sample set in each mini-batch is randomly sampled. When we sample negative samples from the mini-batch, it is theoretically equivalent to The global sample is negatively sampled. This method is relatively simple to implement, and this article uses this online sampling method.
Specifically, during the training process, after the user and item features perform the forward phase of the network, the user embedding and item embedding are obtained. Next, we perform a row-by-row offset operation (row- wise roll), move the rows of the matrix (corresponding to the item embedding) down by N rows as a whole, and then re-insert the N rows that were removed from the matrix into the first N rows of the matrix, which is equivalent to moving in one direction in a circular queue Take N steps. In this way, a negative sample user item pair is obtained, and M negative sample pairs are obtained by repeating the above operation M times.
The modified DropoutNet network is shown in the figure above. First, calculate the cosine similarity between the user semantic vector and the positive sample items, which is denoted as R(u,i+); then calculate the cosine similarity between the user semantic vector and N negative sample items and denote it as R(u,i1−), ⋯,R(u,iN−), perform softmax transformation on the N+1 similarity scores to obtain the user's preference probability for the item; the final loss function is the negative logarithm of the user's preference probability for the positive sample item, as follows:
L=−log(P(i+|u))=−log(exp(R(u,i+))exp(R(u,i+))+∑j∈Negexp(R(u,ij−)))
Furthermore, we refer to the idea of the paper "Support Vector Guided Softmax Loss for Face Recognition", and introduce the method of maximum interval and support vector in the process of implementing the softmax loss function. By using "weaken correct" and "enlarge In the wrong way, the model is forced to challenge more difficult tasks during training, making the model more robust and making it easy to make correct judgments in the prediction stage.
The tensorflow implementation code of support vector guided softmax loss based on negative sampling is as follows:
def softmax_loss_with_negative_mining(user_emb,
item_emb,
labels,
num_negative_samples=4,
embed_normed=False,
weights=1.0,
gamma=1.0,
margin=0,
t=1):
"""Compute the softmax loss based on the cosine distance explained below.
Given mini batches for `user_emb` and `item_emb`, this function computes for each element in `user_emb`
the cosine distance between it and the corresponding `item_emb`,
and additionally the cosine distance between `user_emb` and some other elements of `item_emb`
(referred to a negative sample).
The negative samples are formed on the fly by shifting the right side (`item_emb`).
Then the softmax loss will be computed based on these cosine distances.
Args:
user_emb: A `Tensor` with shape [batch_size, embedding_size]. The embedding of user.
item_emb: A `Tensor` with shape [batch_size, embedding_size]. The embedding of item.
labels: a `Tensor` with shape [batch_size]. e.g. click or not click in the session. It's values must be 0 or 1.
num_negative_samples: the num of negative samples, should be in range [1, batch_size).
embed_normed: bool, whether input embeddings l2 normalized
weights: `weights` acts as a coefficient for the loss. If a scalar is provided,
then the loss is simply scaled by the given value. If `weights` is a
tensor of shape `[batch_size]`, then the loss weights apply to each corresponding sample.
gamma: smooth coefficient of softmax
margin: the margin between positive pair and negative pair
t: coefficient of support vector guided softmax loss
return:
support vector guided softmax loss of positive labels
"""
batch_size = get_shape_list(item_emb)[0]
assert 0 < num_negative_samples < batch_size, '`num_negative_samples` should be in range [1, batch_size)'
if not embed_normed:
user_emb = tf.nn.l2_normalize(user_emb, axis=-1)
item_emb = tf.nn.l2_normalize(item_emb, axis=-1)
vectors = [item_emb]
for i in range(num_negative_samples):
shift = tf.random_uniform([], 1, batch_size, dtype=tf.int32)
neg_item_emb = tf.roll(item_emb, shift, axis=0)
vectors.append(neg_item_emb)
# all_embeddings's shape: (batch_size, num_negative_samples + 1, vec_dim)
all_embeddings = tf.stack(vectors, axis=1)
mask = tf. greater(labels, 0)
mask_user_emb = tf. boolean_mask(user_emb, mask)
mask_item_emb = tf. boolean_mask(all_embeddings, mask)
if isinstance(weights, tf. Tensor):
weights = tf. boolean_mask(weights, mask)
# sim_scores's shape: (num_of_pos_label_in_batch_size, num_negative_samples + 1)
sim_scores = tf.keras.backend.batch_dot(
mask_user_emb, mask_item_emb, axes=(1, 2))
pos_score = tf. slice(sim_scores, [0, 0], [-1, 1])
neg_scores = tf. slice(sim_scores, [0, 1], [-1, -1])
loss = support_vector_guided_softmax_loss(
pos_score, neg_scores, margin=margin, t=t, smooth=gamma, weights=weights)
return loss
def support_vector_guided_softmax_loss(pos_score,
neg_scores,
margin=0,
t=1,
smooth=1.0,
threshold=0,
weights=1.0):
"""Refer paper: Support Vector Guided Softmax Loss for Face Recognition (https://128.84.21.199/abs/1812.11317)."""
new_pos_score = pos_score - margin
cond = tf. greater_equal(new_pos_score - neg_scores, threshold)
mask = tf.where(cond, tf.zeros_like(cond, tf.float32),tf.ones_like(cond, tf.float32)) # I_k
new_neg_scores = mask * (neg_scores * t + t - 1) + (1 - mask) * neg_scores
logits = tf.concat([new_pos_score, new_neg_scores], axis=1)
if 1.0 != smooth:
logits *= smooth
loss = tf.losses.sparse_softmax_cross_entropy(
tf.zeros_like(pos_score, dtype=tf.int32), logits, weights=weights)
# set rank loss to zero if a batch has no positive sample.
loss = tf.where(tf.is_nan(loss), tf.zeros_like(loss), loss)
return loss
Source code: https://github.com/alibaba/EasyRec/blob/master/easy_rec/python/loss/softmax_loss_with_negative_mining.py
Pairwise Ranking
Pointwise, pairwise and listwise are three well-known optimization objectives in the field of LTR (Learning to Rank). Long before the era of deep learning, researchers doing IR have developed a series of basic methods. For more classic work, please refer to "Learning to Rank" to Rank Using Gradient Descent" and "Learning to Rank- From Pairwise Approach to Listwise Approach".
The significance of Pairwise is to make the training goal of the model and the actual task of the model as consistent as possible. For a sorting task, the real goal is to make the predicted score of positive samples higher than that of negative samples, corresponding to indicators such as AUC. In pairwise's classic paper RankNet, the pairwise optimization objective is written as,
Cij=−yijlogPij−(1−yij)log(1−Pij)Pij=ef(xi)−f(xj)1+ef(xi)−f(xj)
Here Pij represents the probability that the model predicts that sample i is more "correlated" than j, where f(xi)−f(xj) is the difference between the pointwise output logit of the two sample models. Intuitively, optimizing Cij is to improve the probability of the model that any positive sample score is higher than any negative sample score, that is, AUC, so this form of pairwise loss is also called AUC loss.
Similarly, in order to facilitate implementation and reduce the workload of constructing pair samples offline, we chose the In-batch Random Pairing method to calculate the pairwise rank loss by constructing pairs from the mini batch during the training process. The specific implementation code is as follows:
def pairwise_loss(labels, logits):
pairwise_logits = tf.expand_dims(logits, -1) - tf.expand_dims(logits, 0)
logging.info('[pairwise_loss] pairwise logits: {}'.format(pairwise_logits))
pairwise_mask = tf. greater(
tf.expand_dims(labels, -1) - tf.expand_dims(labels, 0), 0)
logging.info('[pairwise_loss] mask: {}'.format(pairwise_mask))
pairwise_logits = tf.boolean_mask(pairwise_logits, pairwise_mask)
logging.info('[pairwise_loss] after masking: {}'.format(pairwise_logits))
pairwise_pseudo_labels = tf.ones_like(pairwise_logits)
loss = tf.losses.sigmoid_cross_entropy(pairwise_pseudo_labels,
pairwise_logits)
# set rank loss to zero if a batch has no positive sample.
loss = tf.where(tf.is_nan(loss), tf.zeros_like(loss), loss)
return los

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us