How to establish a user behavior model framework
▌Research background
A person is defined by the behaviors they exhibit. Accurate and in-depth research on users is often the core of many business issues. In the long run, as more and more types of behaviors can be recorded by people, the platform side needs to be able to better understand users by integrating various user behaviors, so as to provide better personalized services.
For Alibaba, global marketing with consumer operations as the core concept is a data & technology-driven solution that combines user-wide ecological behavior data to help brands realize new marketing. Therefore, the research on user behavior has become a very core issue. Among them, the biggest challenge comes from whether the user's heterogeneous behavior data can be processed more finely.
In this context, this paper proposes a general user representation framework, which tries to integrate different types of user behavior sequences, and uses this framework to verify the effect in the recommendation task. In addition, we also expect to be able to use this user representation to achieve different downstream tasks through multi-task learning.
▌Related work
Heterogeneous Behavioral Modeling: User characteristics are typically represented through manual feature engineering. These manual features are mainly aggregated features or non-sequential id feature sets.
Single-behavior sequence modeling: User sequence modeling usually uses RNN (LSTM/GRU) or CNN + Pooling. RNN is difficult to parallelize, the training and prediction time is long, and the internal memory in LSTM cannot remember specific behavior records. CNNs also fail to preserve specific behavioral features and require deeper layers to establish arbitrary behavioral influences.
Heterogeneous data representation learning: refer to the knowledge map and Multi-modal representation research work, but usually have very obvious mapping supervision. In our task, there is no obvious mapping relationship between heterogeneous behaviors like this task.
The main contributions of this paper are as follows:
Tried to design and implement a method that can integrate multiple time series behavior data of users. The more innovative idea is to propose a solution that considers heterogeneous behavior and time series at the same time, and gives a relatively simple implementation method.
Use Google's self-attention mechanism to remove the limitations of CNN and LSTM, so that the network training and prediction speed can be accelerated, and the effect can be slightly improved.
This framework is easily extensible. It can allow more different types of behavioral data to be accessed, and at the same time provide opportunities for multi-task learning to compensate for behavioral sparsity.
▌ATRank program introduction
The entire user representation framework includes the original feature layer, semantic mapping layer, Self-Attention layer and target network. The semantic mapping layer enables different behaviors to be compared and interacted in different semantic spaces. The Self-Attention layer makes a single behavior itself a record that takes into account the impact of other behaviors. The target network can accurately find relevant user behaviors for prediction tasks through Vanilla Attention. Through the idea of TimeEncoding + Self Attention, our experiments show that it can indeed replace CNN/RNN to describe sequence information, which can make the training and prediction speed of the model faster.
1. Behavior grouping
A user's behavior sequence can be described by a triplet (action type, goal, time). We first group different user behaviors according to the target entity, as shown in the bottom of the figure with different color groups. For example, product behavior, coupon behavior, keyword behavior, and so on. The action type can be click, favorite, additional purchase, claim, use, etc.
Each entity has its own distinct attributes, including real-valued features and discrete id-like features. The action type is the id class, and we also discretize the time. The three parts are added to get the vector group of the next layer.
That is, the code of a behavior = custom target code + lookup (discretization time) + lookup (action type).
Since the amount of information of entities is different, the lengths of the vectors encoded by each group of behaviors are different, which actually means that the amount of information contained in the behaviors is different. In addition, some parameters may be shared between different behaviors, such as the lookuptable of features such as store id and category id, which can reduce a certain amount of sparsity and reduce the total amount of parameters.
The main purpose of grouping is not only convenient to explain, but also related to implementation. Because variable length and heterogeneous processing are difficult to efficiently implement without grouping. And we can also see later that our approach does not actually enforce chronological ordering of behaviors.
2. Semantic Space Mapping
This layer achieves the same-semantic communication between heterogeneous behaviors by linearly mapping heterogeneous behaviors to multiple semantic spaces. For example, the space to be expressed in the framework diagram is the atomic semantic space composed of red, green and blue (RGB), and the following composite colors (different types of user behaviors) will be projected into each atomic semantic space. In the same semantic space, the same semantic components of these heterogeneous behaviors are comparable.
Similar ideas actually appear in the knowledge graph representation. In the field of NLP, some studies this year have shown that the attention mechanism of multi-semantic space can improve the effect. Personally, I think one explanation is that if there is no division of multiple semantic spaces, the so-called semantic neutralization problem will occur. A simple understanding is that two different types of behaviors a and b may only be relevant in certain fields, but when the attention score is a global scalar, a and b will increase each other in less relevant fields influence, which is weakened in highly correlated areas.
Although from an implementation point of view, this layer is to map all behavior codes to a unified space, and the mapping method can be linear or non-linear, but in fact, for the subsequent network layer, we can see it as a The large space is divided into multiple semantic spaces, and self-attention operations are performed in each subspace. Therefore, in terms of interpretation, we simply describe this mapping directly as a projection on multiple sub-semantic spaces.
3. Self Attention layer
The purpose of the Self Attention layer is actually to transform every behavior of the user from an objective representation into a representation in the user's memory. Objective characterization means that, for example, A and B have done the same thing, and the characterization of the behavior itself may be the same. But this behavior may be completely different in intensity and clarity in the memories of A and B, because other behaviors of A and B are different. In fact, observing the softmax function shows that the more similar behaviors are done, the more their representations will be averaged. Behaviors that bring about a different experience are more likely to retain their own information. Therefore, self attention actually simulates the representation of a behavior affected by other behaviors.
In addition, Self Attention can have multiple layers. It can be seen that a layer of Self-Attention corresponds to a first-order behavioral impact. Multi-layers take into account multi-order behavioral impacts. This network structure draws on Google's self-attention framework.
The specific calculation method is as follows:
Note that S is the output after splicing the entire semantic layer, which is the projection on the k-th semantic space, then the expression calculation formula of the k-th semantic space after self-attention is:
The attention function here can be regarded as a bilinear attention function. The final output is the splicing of these space vectors and then adding a feed-forward network.
4. Target network
The target network is customized for different downstream tasks. The task involved in this article is the task of user behavior prediction and click prediction of recommended scenes, and the point-wise method is used for training and prediction.
The gray bars in the frame diagram represent any kind of behavior to be predicted. We also convert this behavior through embedding, projection, etc., and then do vanilla attention with the behavior vector generated by the user representation. Finally, the Attention vector and the target vector will be sent to a Ranking Network. Features that are strongly related to other scenes can be placed here. This network can be arbitrary, it can be wide & deep, deep FM, pnn. Our experiment in the paper is a simple dnn.
▌Offline experiment
In order to compare the effect of the framework on single-behavior prediction, we experiment on the public dataset of amazon purchase behavior.
Experimental conclusion: In behavior prediction or recommendation tasks, self-attention + time encoding can also better replace the encoding methods of cnn+pooling or lstm. The training time can be 4 times faster than cnn/lstm. The effect can also be slightly better than other methods.
Case Study
In order to delve into the significance of Self-Attention in multi-space, we did a simple case study on the amazon dataset. As shown below:
From the figure, we can see that different spaces focus on very different things. For example, the trend of the attention scores of each row in spaces I, II, III, and VIII is similar. This may be the main effect of reflecting the overall effect of different behaviors. In other spaces, such as VII, high-score attention tends to form dense squares, and we can see that this is actually because these commodities belong to the same category.
The figure below shows the scores of vanilla attention in different semantic spaces.
multitasking learning
In the paper, we collected offline data from Ali e-commerce users for training on three behaviors of purchasing goods, clicking on favorites, adding purchases, receiving coupons, and searching for keywords. Similarly, we also predicted these three different behaviors at the same time. Among them, the user's product behavior records are the entire network, but the final product click behavior to be predicted is the real exposure and click record of a recommended scene in the store. The training and prediction of coupons and keywords are all network-wide behaviors.
We constructed 7 training modes for comparison. They are single-behavior samples predicting similar behaviors (3 types), full-behavior multi-model predictions for single behaviors (3 types), and full-behavior single-model predictions for all behaviors (1 type). In the last experimental setting, we cut the three prediction tasks into mini-batches, and then shuffle and train them uniformly.
The experimental results are as follows:
all2one is three models predicting three tasks respectively, and all2all is a single model predicting three tasks, that is, the three tasks share all parameters, and there is no exclusive part. Therefore, it is understandable that all2all is slightly lower than all2one. When we train multi-task all2all, three different prediction tasks are batched and then fully random shuffle is performed. The multi-task training method in this article still has a lot of room for improvement, and some good methods for reference have appeared at the frontier, which is one of the directions we are currently trying.
Experiments show that our framework can achieve better recommendation and behavior prediction effects by incorporating more behavioral data.
▌Summary
In this paper, a general user representation framework is proposed to fuse different types of user behavior sequences, which is verified in the recommendation task.
In the future, we hope to combine more practical business scenarios and richer data to develop a flexible and scalable user representation system, so as to better understand users, provide better personalized services, and output more comprehensive data capabilities.
A person is defined by the behaviors they exhibit. Accurate and in-depth research on users is often the core of many business issues. In the long run, as more and more types of behaviors can be recorded by people, the platform side needs to be able to better understand users by integrating various user behaviors, so as to provide better personalized services.
For Alibaba, global marketing with consumer operations as the core concept is a data & technology-driven solution that combines user-wide ecological behavior data to help brands realize new marketing. Therefore, the research on user behavior has become a very core issue. Among them, the biggest challenge comes from whether the user's heterogeneous behavior data can be processed more finely.
In this context, this paper proposes a general user representation framework, which tries to integrate different types of user behavior sequences, and uses this framework to verify the effect in the recommendation task. In addition, we also expect to be able to use this user representation to achieve different downstream tasks through multi-task learning.
▌Related work
Heterogeneous Behavioral Modeling: User characteristics are typically represented through manual feature engineering. These manual features are mainly aggregated features or non-sequential id feature sets.
Single-behavior sequence modeling: User sequence modeling usually uses RNN (LSTM/GRU) or CNN + Pooling. RNN is difficult to parallelize, the training and prediction time is long, and the internal memory in LSTM cannot remember specific behavior records. CNNs also fail to preserve specific behavioral features and require deeper layers to establish arbitrary behavioral influences.
Heterogeneous data representation learning: refer to the knowledge map and Multi-modal representation research work, but usually have very obvious mapping supervision. In our task, there is no obvious mapping relationship between heterogeneous behaviors like this task.
The main contributions of this paper are as follows:
Tried to design and implement a method that can integrate multiple time series behavior data of users. The more innovative idea is to propose a solution that considers heterogeneous behavior and time series at the same time, and gives a relatively simple implementation method.
Use Google's self-attention mechanism to remove the limitations of CNN and LSTM, so that the network training and prediction speed can be accelerated, and the effect can be slightly improved.
This framework is easily extensible. It can allow more different types of behavioral data to be accessed, and at the same time provide opportunities for multi-task learning to compensate for behavioral sparsity.
▌ATRank program introduction
The entire user representation framework includes the original feature layer, semantic mapping layer, Self-Attention layer and target network. The semantic mapping layer enables different behaviors to be compared and interacted in different semantic spaces. The Self-Attention layer makes a single behavior itself a record that takes into account the impact of other behaviors. The target network can accurately find relevant user behaviors for prediction tasks through Vanilla Attention. Through the idea of TimeEncoding + Self Attention, our experiments show that it can indeed replace CNN/RNN to describe sequence information, which can make the training and prediction speed of the model faster.
1. Behavior grouping
A user's behavior sequence can be described by a triplet (action type, goal, time). We first group different user behaviors according to the target entity, as shown in the bottom of the figure with different color groups. For example, product behavior, coupon behavior, keyword behavior, and so on. The action type can be click, favorite, additional purchase, claim, use, etc.
Each entity has its own distinct attributes, including real-valued features and discrete id-like features. The action type is the id class, and we also discretize the time. The three parts are added to get the vector group of the next layer.
That is, the code of a behavior = custom target code + lookup (discretization time) + lookup (action type).
Since the amount of information of entities is different, the lengths of the vectors encoded by each group of behaviors are different, which actually means that the amount of information contained in the behaviors is different. In addition, some parameters may be shared between different behaviors, such as the lookuptable of features such as store id and category id, which can reduce a certain amount of sparsity and reduce the total amount of parameters.
The main purpose of grouping is not only convenient to explain, but also related to implementation. Because variable length and heterogeneous processing are difficult to efficiently implement without grouping. And we can also see later that our approach does not actually enforce chronological ordering of behaviors.
2. Semantic Space Mapping
This layer achieves the same-semantic communication between heterogeneous behaviors by linearly mapping heterogeneous behaviors to multiple semantic spaces. For example, the space to be expressed in the framework diagram is the atomic semantic space composed of red, green and blue (RGB), and the following composite colors (different types of user behaviors) will be projected into each atomic semantic space. In the same semantic space, the same semantic components of these heterogeneous behaviors are comparable.
Similar ideas actually appear in the knowledge graph representation. In the field of NLP, some studies this year have shown that the attention mechanism of multi-semantic space can improve the effect. Personally, I think one explanation is that if there is no division of multiple semantic spaces, the so-called semantic neutralization problem will occur. A simple understanding is that two different types of behaviors a and b may only be relevant in certain fields, but when the attention score is a global scalar, a and b will increase each other in less relevant fields influence, which is weakened in highly correlated areas.
Although from an implementation point of view, this layer is to map all behavior codes to a unified space, and the mapping method can be linear or non-linear, but in fact, for the subsequent network layer, we can see it as a The large space is divided into multiple semantic spaces, and self-attention operations are performed in each subspace. Therefore, in terms of interpretation, we simply describe this mapping directly as a projection on multiple sub-semantic spaces.
3. Self Attention layer
The purpose of the Self Attention layer is actually to transform every behavior of the user from an objective representation into a representation in the user's memory. Objective characterization means that, for example, A and B have done the same thing, and the characterization of the behavior itself may be the same. But this behavior may be completely different in intensity and clarity in the memories of A and B, because other behaviors of A and B are different. In fact, observing the softmax function shows that the more similar behaviors are done, the more their representations will be averaged. Behaviors that bring about a different experience are more likely to retain their own information. Therefore, self attention actually simulates the representation of a behavior affected by other behaviors.
In addition, Self Attention can have multiple layers. It can be seen that a layer of Self-Attention corresponds to a first-order behavioral impact. Multi-layers take into account multi-order behavioral impacts. This network structure draws on Google's self-attention framework.
The specific calculation method is as follows:
Note that S is the output after splicing the entire semantic layer, which is the projection on the k-th semantic space, then the expression calculation formula of the k-th semantic space after self-attention is:
The attention function here can be regarded as a bilinear attention function. The final output is the splicing of these space vectors and then adding a feed-forward network.
4. Target network
The target network is customized for different downstream tasks. The task involved in this article is the task of user behavior prediction and click prediction of recommended scenes, and the point-wise method is used for training and prediction.
The gray bars in the frame diagram represent any kind of behavior to be predicted. We also convert this behavior through embedding, projection, etc., and then do vanilla attention with the behavior vector generated by the user representation. Finally, the Attention vector and the target vector will be sent to a Ranking Network. Features that are strongly related to other scenes can be placed here. This network can be arbitrary, it can be wide & deep, deep FM, pnn. Our experiment in the paper is a simple dnn.
▌Offline experiment
In order to compare the effect of the framework on single-behavior prediction, we experiment on the public dataset of amazon purchase behavior.
Experimental conclusion: In behavior prediction or recommendation tasks, self-attention + time encoding can also better replace the encoding methods of cnn+pooling or lstm. The training time can be 4 times faster than cnn/lstm. The effect can also be slightly better than other methods.
Case Study
In order to delve into the significance of Self-Attention in multi-space, we did a simple case study on the amazon dataset. As shown below:
From the figure, we can see that different spaces focus on very different things. For example, the trend of the attention scores of each row in spaces I, II, III, and VIII is similar. This may be the main effect of reflecting the overall effect of different behaviors. In other spaces, such as VII, high-score attention tends to form dense squares, and we can see that this is actually because these commodities belong to the same category.
The figure below shows the scores of vanilla attention in different semantic spaces.
multitasking learning
In the paper, we collected offline data from Ali e-commerce users for training on three behaviors of purchasing goods, clicking on favorites, adding purchases, receiving coupons, and searching for keywords. Similarly, we also predicted these three different behaviors at the same time. Among them, the user's product behavior records are the entire network, but the final product click behavior to be predicted is the real exposure and click record of a recommended scene in the store. The training and prediction of coupons and keywords are all network-wide behaviors.
We constructed 7 training modes for comparison. They are single-behavior samples predicting similar behaviors (3 types), full-behavior multi-model predictions for single behaviors (3 types), and full-behavior single-model predictions for all behaviors (1 type). In the last experimental setting, we cut the three prediction tasks into mini-batches, and then shuffle and train them uniformly.
The experimental results are as follows:
all2one is three models predicting three tasks respectively, and all2all is a single model predicting three tasks, that is, the three tasks share all parameters, and there is no exclusive part. Therefore, it is understandable that all2all is slightly lower than all2one. When we train multi-task all2all, three different prediction tasks are batched and then fully random shuffle is performed. The multi-task training method in this article still has a lot of room for improvement, and some good methods for reference have appeared at the frontier, which is one of the directions we are currently trying.
Experiments show that our framework can achieve better recommendation and behavior prediction effects by incorporating more behavioral data.
▌Summary
In this paper, a general user representation framework is proposed to fuse different types of user behavior sequences, which is verified in the recommendation task.
In the future, we hope to combine more practical business scenarios and richer data to develop a flexible and scalable user representation system, so as to better understand users, provide better personalized services, and output more comprehensive data capabilities.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00