Core technology disclosure of search model

1. Background and Significance

User modeling is the core technology of search and recommendation models. The object of Taobao search sorting calculation is triplet. From the perspective of sample feature expression, item is a relatively dense and stable part. In a large sample environment, most of the information can be expressed by id embedding. On the contrary User is a relatively sparse part of the three, so for the description of user, a large number of generalization features are required.

From the perspective of model classification, the role of static features of users and products is to enhance the generalization of the model, and the introduction and modeling of user real-time behavior can greatly enhance the distinction between samples and significantly improve the classification accuracy of the model . We regard the process of user modeling as the process of abstracting and organizing information about users.

In terms of information abstraction, we continue to optimize and enrich the modeling methods:

User profile is used to represent the user's static attribute information;
Mining of preference tags, predicting users' general preferences from behavior;
Real-time behavior modeling, more fine-grained characterization and description of interest under the current request.
In terms of information processing, we reasonably organize user behavior data in terms of behavior cycle and behavior content:

From the behavior cycle, we divide the behavior sequence into short-term and long-term, and use different time spans to describe interests of different granularities;
From the perspective of behavior content, direct behavior feedback products and exposed products are used to express user intentions explicitly and implicitly. At the same time, we also extend user behavior data from traditional e-commerce products to some pan-content information .

User behavior modeling is also the research direction of many teams such as recommendation and advertising (such as DIN, DIEN, Practice on Long Sequential User Behavior Modeling of the advertising team, MIND, DSIN of the recommendation team, etc. very good work), we strive to start from Starting from practical applications, we will share with you some practical experience of user modeling in search scenarios. At the same time, write down the phenomena and problems we have observed, and welcome everyone to discuss and exchange. The subsequent content of the article is as follows:

Briefly introduce the overall model structure of the large model.
Organization and processing of user data and behavior sequences;
Improvements to the model structure;
Model experiments and analysis, research and discussion of relevant practical issues.

2. Model structure

Our model structure is roughly shown in the figure, user portraits, multiple user behavior sequence features, product features to be scored and some other real-time context features (weather, network, time, etc.), after the final concat, enter the classification of DNN device. User behavior sequence modeling is one of the most important parts of the model, especially important for real-time depiction of user interests. Here we use self atten and atten pooling to do sequence modeling. Self atten describes the relationship between behaviors, and atten pooling matches and activates behaviors and realizes combine. This is one of our generalized sequence modeling components. Based on this model framework, we will further introduce our optimization work this year and its implementation details.

3. User data and models

User profile is just some relatively static user features, these static features are used as a supplement and generalization to user_id. Recently, the model of understanding users' real-time interests represented by session-based recommendation has been widely studied, and a large number of experiments have shown that it can significantly improve the accuracy of recommendation. In particular, the shorter the user's current behavior, the more representative the user's current interest state. Using the products that users have acted to represent users helps us capture the dynamics of user interests in real time. In addition, from the perspective of Graph, in a graph composed of user-items, item is a dense node, user is a sparse node, and dense nodes are suitable for expressing sparse nodes.

We define a unified behavior schema as shown below:

It consists of two parts, the attribute characteristics of the product itself and the user's behavior characteristics of the product. Attribute features include item_id, seller_id and a series of features that describe the product. Behavior features include the user's behavior type, behavior time, sequence position and other characteristics of the product.

3.1 Short and medium cycle

Data and features: In the Starship project last year, we conducted a global modeling for the short-term and short-term behaviors of different types of behaviors of users. This sequence includes real-time clicks, purchases, favorites, purchases and other behaviors of users on the entire network. In view of the rich historical behavior of users, and our sequence length has an upper limit (L_max=50), we use the predicted category of the query to select historical behaviors that are more relevant to the current intent category. The user actively enters a query, which is the biggest difference between search and recommendation. Regarding the query, the query understanding team has done a lot of work. Here, we use the results of query understanding and use the query's predicted leaf categories to filter user behavior sequences.

Model structure:

The Transformer model has achieved very good results in the field of NLP. It uses the self atten mechanism to capture the dependencies within the sequence. At the same time, it can be calculated in parallel to improve the speed of training and prediction. In the CTR model, we extensively use self atten to process item sequences. We make the following two improvements on the basis of the existing self-atten.

We use cosine + scale up instead of the original dot product + scale down (you can also add layer norm) to calculate the similarity between K and Q and improve the discrimination of softmax logits output values. Because when applying the original self-atten, we found some problems: (1). The softmax logits value is very small, causing the weight of the atten to always be in the mean state; (2). Gradient dispersion occurs in the atten structure, which makes it almost impossible to learn . Our CTR model embedding is initialized from 0 instead of random initialization, and the small gradient makes it difficult to learn self atten. To do this, we use the cosine values instead of dot products to calculate the distance between K and Q. As shown below:

We performed a query_atten_pooling on the output of self-atten. Although our short-term and short-term sequences try to ensure category consistency with the current intent, in order to further ensure query consistency, here we perform a query_atten_pooling on the short- and medium-term sequences to activate historical behaviors that are consistent with the intent and semantics of the current query; follow-up experiments It will also further analyze why the general target attention is not used here.

The model structure of the entire short-term behavior modeling is shown in the figure below, self atten + query atten pooling:

Among them, Context Masked Embedding operates as e1+c1.

3.2 Long cycle

Data and features: short-to-medium-term behavior sequences, in the limited-length behavior, taking into account the short-term and short-term behaviors of users in the entire network. Naturally, we can think that the user's early behavior is probably not in the sequence. In the Starship project last year, we supplemented this part of information by offline statistics of static features of a user's long-term preferences, but this method cannot combine the user's current intent and context to select relevant preference information.

We hope to be able to model the long-term behavior of users end2end. On the one hand, we can introduce the long-term preferences of users; second, we can model long-term behaviors based on current intentions, instead of statically extracting preference labels; third, we can introduce long-term behaviors from last year. The data of the same season solves the problem of cold start of personalized season change.

We define the user's long-term behavior as follows: the user's transaction behavior in the last two years; other types such as clicks are not considered for the time being. It is said that users only remember what they bought last year, but not what they ordered last year. Second, there are too many other types of data.

Specifically, we divide the two-year behavioral data into 8 quarterly sequences according to quarters, and each quarterly sequence is a subsequence of length N to ensure that the behavior of each quarter can be preserved. Truncation may result in all the behavior information of a certain quarter being filtered out, and the user’s multi-faceted information cannot be retained as much as possible; at the same time, we have not done a category-related filtering on the sequence. Compared with short-term behavior, long-term behavior is more important. Focus on the preference for brand store terms, etc.

Model structure:

As shown in the figure below, first add_mask is performed on the long-term behavior item_level embedding feature with context_level embedding, that is, e1+c1. Then use a multi-layer modeling method to perform feature extraction and sub-sequence atten_pooling through the optimized self atten in the quarter sequence to obtain the user preference expression for the current quarter. Finally, concat is performed on different quarters (atten_pooling is also tried, the effect is slightly worse than concat), and the user's final long-term preference expression is obtained. The advantage of this is that for season-sensitive search intentions, preference information that better matches the current quarter can be extracted better.
The atten_pooling here uses shortseq_vec as the query query; Shortseq_vec is an expression that combines user intentions and user real-time preferences, which can more comprehensively evaluate the historical trigger for the current search than directly using the original search terms (only expressing the user's search intention). The importance of; at the same time, short seq also belongs to the sequence, and the vector space is relatively close.

3.3 Click on the terminal

The first two parts capture the user’s behavioral products and corresponding contextual information from the long-term and short-term respectively, but the actual user’s behavior is far more than clicking and purchasing; on the search results page, the product list is exposed, and users browse and swipe the screen; On the details page, users watch videos, compare parameters, etc.; users have very rich behaviors that have not been noticed by us. On the one hand, some features cannot be obtained through server logs, and on the other hand, some features are delayed a lot through ut logs, making online comparison Difficult to use; therefore, we take advantage of end computing to collect these data features on the client and transmit them to the server. At the same time, we can also reduce rt to within ms. In the next two parts, we will introduce the click sequence on the end and the end Exposure sequence modeling on

Data and features: There are several differences between the modeling of the click sequence on the terminal and the click behavior in the short-to-medium term sequence. To more detailed user behavior, the user's stay time in each block of the details page, which buttons were clicked; third, there is no query correlation filter for the sequence, hoping to introduce some useful information from cross-category behavior, as a A further supplement to the short-to-medium-term sequence, rather than repetition; user behavior from the instance_level, each behavior has a certain importance, but from the feature_level, the importance of each behavior also depends on the brand, store, term, etc. The degree of matching between the features and the target product is not only to emphasize that the behavior is consistent with the category prediction of the query;

Model structure: The modeling method of the click sequence on the terminal is similar to the modeling method of the short-term sequence, which captures the user's preference expression through optimized self atten+atten pooling; similarly, the behavior of the detail page we obtained can also be used as context Information, add_mask for item_level embedding; atten_pooling here also uses shortseq_vec as query query;

3.4 On-device exposure

Earlier we introduced the modeling of user preferences, basically starting from the user's active behavior, using the user's favorite products to describe and understand an unfamiliar user. So, can we start from the aspects that users don't like to describe users? So we collect, model, and integrate the exposed but unclicked products of users into our large model. Different from other product sequences, the exposed product has a very important property, it is the most accurate product that can actually be searched under the current user + query + context. To some extent, the exposed products synthesize the rich information of this search and can effectively express the user/query powerfully.

Data and features: Compared with users' clicks and purchases, the volume of pv exposure data is very large, so we did not choose to write it into igraph in real time and then read it, but distributed it on the end and stored it on the end. When requesting a search, it is sent directly, bypassing the storage problem and ensuring the real-time performance of the data. Specifically, when a user searches for "dress", the client will upload the information of 50 products that were exposed but not clicked on when the user recently searched for "dress" to the server, for example, when sorting the second page , you will get the exposed products on the first page.

Model structure: The modeling method of the exposure sequence is relatively simple, and the exposure expression is obtained through mean pooling. However, in order to distinguish between the positive expression and the exposure expression, we first calculate the distances d1 and d2 between the positive expression and the exposure expression and the target product, and add an auxiliary loss to make the two distances meet the following conditions: when the target product corresponds to When it is a positive sample, d1 + M < d2; when the target product corresponds to a negative sample, d1 > d2 + M. The modeling method is similar to that shown in the figure below.

4. Experiment and Analysis

Data set: Take the online exposure and click data as samples, the samples from the previous N days as training samples, train until convergence, and the samples from the next day as test samples.

Evaluation indicators: The next day's samples are used as test samples, mainly evaluating the AUC on the training set, and also observing the AUC on the training set and other indicators. The auc gap here is the auc improvement of the full online model.

Effect comparison: We conduct comparative experiments with some baselines, where LSTM with Q/U/P Attention is DUPN (kdd '18). The results are shown in the table below:

Effect analysis: The auc gap here is an improvement on the online convergent version. The optimization of the behavioral sequence model can bring about a 0.3% increase in AUC; for different problems, the introduction of new sequence features and corresponding modeling optimization can bring about a 0.7% increase in AUC.

4.1 Short and medium period series

Attention Weight analysis: We print out the Atten Map in the tensorboard, and the square (i, j) represents the i-th product and the atten weight of the j-th product. We can observe the following phenomena:

As shown in the figure, the lighter the color of the grid, the greater the weight. The beginning of the training does not reflect the role of self atten in capturing the internal correlation of the product sequence, but only learns some behavioral information, that is, the previous product (the product with the most recent behavior) important. As the training progresses, the correlation between commodities is gradually reflected, and the direct performance is that the weight near the diagonal becomes larger. What is finally presented is an attention map that is affected by both product relevance and behavior information. In addition, for the atten pooling layer, we monitor the weight information of the sequence at each position. On the whole, the position of the product sequence decreases in order from near to far, which is in line with expectations.

The role of Self Atten: We tried to remove the Self Atten layer, and directly performed Atten Pooling on seq, and the absolute value of AUC dropped steadily by 0.001. We understand Self Atten from two aspects: 1. As a general explanation, it does model the dependencies within the product sequence, making product expression more accurate; 2. During the Self Atten process, the user's recent behavior Commodities, as the closest representation of user intentions, are used as Keys to attend to the entire sequence, and the obtained vectors become the information that accounts for the largest proportion in the final Atten Pooling.

Why Not Target Atten in Search: Target Atten is a method commonly used in recommendation scenarios. We also tried it in search, but it was not used in the end. There are two reasons: 1. In search scenarios, users will actively input queries, which have already been It can strongly express the user's current intentions (such as category intentions). At this time, adding the target item will make the income much smaller, and the AUC will be slightly improved. In addition, our user sequence also uses the industry/category predicted by the query for preliminary screening. 2. Finally, during fine sorting, Target Atten needs to calculate Atten Pooling for each doc once, which will add a certain amount of additional calculation overhead, while query atten only needs to be calculated once, which has better performance. Although the revenue is less in the search scenario, Target Atten is still the most effective way in the recommendation scenario.

4.2 Other sequence modeling

Long-period sequence experiments:

In order to analyze whether the long cycle can achieve the role of personalized cold start in the same season last year, we do a simple experiment: put together the 8 quarter sequences of the long cycle as a sequence sequence, and use self atten to observe all quarters between quarters Sequence relatedness. As shown in the figure: the horizontal axis is seq_index, and the lighter the color, the greater the weight. You can see the phenomenon similar to that in the short-to-medium cycle behavior sequence, and finally presents the attention map affected by both product association and behavior information. One column is the more important behavior in the front position, the fifth column is the same behavior as the current quarter, and the second most important, and the diagonal line represents the atten value with itself.

End click sequence experiment:

Similar to the atten map where we can visualize the click sequence on the end. As shown in the figure, at the beginning of training, the importance is mainly affected by the behavioral context, and the higher the position, the more important it is; the final result is slightly different from the short-to-medium cycle sequence, and the most important thing is the diagonal and surrounding areas. The diagonal line is easy to understand, and its surrounding color is also relatively light. It can be understood that there is an obvious correlation between the products that were clicked before and after.

Exposed products: From the results of AUC, it can be seen that the model using exposed products as features has improved significantly, and the absolute value of AUC is +0.002. As a product that has not been clicked on, it can express products that users do not like to a certain extent. At the same time, as a product that can be recalled under the current query, it is also a more sufficient expression of the query.

V. Summary and Outlook

We detail the user modeling part of the Taobao search CTR/CVR model. We perceive users comprehensively, describe users with more user portraits, more comprehensive and real-time behavioral data, understand users more deeply, and abstract user intentions with more refined and reasonable models. In the Double Eleven business of the main search, the work of user modeling was fully launched as part of the Optimus Prime project. Compared with the algorithm benchmark bucket, the overall search directly led to a significant improvement in GMV. Looking forward to the future, we still have a lot of work to do. We need to explore and solve more detailed user data perception, more scientific user data organization, and more appropriate model structure.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us