How to apply GCN in a 1 billion node heterogeneous network
1 Introduction
In the composition, the interaction between users and products is usually the most direct and effective way of edge connection. It is an explicit description of user preferences and has achieved a certain improvement in recommendation effects. The biggest problem with this scheme is the explicit interaction data. There is a large data sparsity. In actual scenarios, there is a large amount of heterogeneous information that can be introduced to enhance the richness of network representations, such as user search terms, visited stores, preferred brands, preferred attributes, etc. These features can enhance richer semantic representation and relevance Describe, intentGC is a unified network embedding learning framework based on GCN proposed in this paper, which integrates explicit preference relationship and rich heterogeneous relationship information between users and products to improve the effect of recommendation system. The core technology in the algorithm is image volume Convolution, we have made some innovative optimizations on the basis of classic graph convolution to better solve core challenges such as strong heterogeneity and large-scale in our business.
2. Problem Definition
3. Model Design
The model designed in this paper is a large-scale graph convolution learning algorithm that integrates multiple information, using bipartite heterogeneous graph modeling, and the loss design uses triplet targets, which can effectively control and focus on learning the user's display preferences and expressions. The entire learning process is A semi-supervised model that effectively utilizes a large amount of unlabeled information in the e-commerce system to improve the accuracy of learning objectives. The core of the solution consists of three parts, one is network translation, which performs a lossless translation of the original network; the other is fast convolutional network, which performs efficient convolution on heterogeneous information; the third is dual convolution, which learns user and product representation based on HIN translation .
internet translation
The introduction of a variety of heterogeneous nodes into the network brings richer information and also poses the challenge of semantic incompatibility. The calculation of distinguishing node types is a huge complexity and computational burden for large-scale networks with various heterogeneous nodes and edges. Based on the second-order similarity, this paper uses related research to translate the original network into user-user or product-product relationship. The similarity calculation is based on the same number of additional information. The core idea is that if u1 and u2 have the same If the auxiliary information is connected, u1 and u2 are also related, so that the semantic information of heterogeneous nodes in the network can be encoded into user-user relationship or commodity-commodity relationship to realize the translation of original network information.
Fast Convolutional Network IntentNet
The original GCN has huge computational complexity when calculating on a large-scale graph, because the content will be propagated through high-order transmission methods, and the complexity is exponential. The fast convolutional network intentNet proposed in this paper can effectively solve this problem through the following two optimizations: First, in the convolution operator, not all neurons are actually equally important. In the activation process, in fact Only the most relevant neurons have the greatest effect, so we design graph convolution as sparse network activation, which can also be regarded as vector learning for channel sharing, and realize neighbor information propagation through vectorized convolution; second, we It is also found that the original high-exponential convolution complexity mainly comes from high-order nodes, but this training method can be decoupled and can be split into two training modules: graph view and node view. Based on these two observations, we redesigned graph convolution to realize feature combination through a fully connected network. Experiments show that it has better efficiency and effect than GraphSage.
a) Vectorized convolution function
Representation learning mainly does two tasks. One is to learn the relationship between the node itself and its neighbors to measure the impact of neighbors on the effect. The other is to learn the spatial relationship of vectors in different dimensions to automatically extract useful combination features. Graph convolution includes two steps, one is aggregation :
The second is convolution
This article calls it bit-wise convolution. In fact, we found that it is not necessary to calculate the interaction between all the features. We design the graph convolution as a sparse network activation, which can also be regarded as a vector learning for channel sharing. Neighbor information propagation is realized through vectorized convolution. This paper The designed vectorized convolution function is as follows:
b) IntentNet
The convolution training method is designed to be divided into two training modules: graph view and node view, and then the function of graph convolution is obtained through the combination of the two. The former is based on the above-mentioned vectorized convolution function, and stacking multiple convolution layers can effectively learn neighbors. The node propagates the relationship to realize the task of graph convolution, and the latter is connected to the fully connected layer to learn the feature relationship of different dimensional vector spaces.
dual convolution
In order to accurately describe the representation and label information of user and item, different from the traditional GCN, we designed a dual GCN structure and learned it in the same framework. The specific solution is that user performs independent convolution, item and negative sampling perform shared convolution, and then at the end of the convolution layer, the three are projected into the same semantic space through the dense network, and finally, the method of triplet loss is used for learning. The advantage of this structure is that it can have more accurate heterogeneous representation capabilities than the classic GCN. At the same time, it has been proved that this method can also make the two dual convolutions converge, and has a good semi-supervised effect.
IntentGC algorithm framework
The intentGC algorithm framework mainly includes three parts: 1) network translation; 2) training; 3) inference. After training, we can obtain the vector representations of users and products, and then use the idea of k-nearest neighbors for retrieval and recommendation.
4. Experimental conclusion
In the experiment, we mainly verify the effect comparison between IntentGC and existing algorithms, compare the efficiency of IntentNet with GraphSage in processing billion-scale graph learning tasks, and verify the comparison of model learning capabilities that add heterogeneous information. We conduct offline based on the data of Taobao and Amazon. Evaluation, comparing DeepWalk, GraphSage, DSPR, Metapath2vec++, BiNE and other algorithms, the offline evaluation results on the Taobao and Amazon datasets and the online experiments in the Taobao environment all show the effectiveness of our algorithm.
5. Summary and Outlook
This paper proposes a new large-scale graph convolution learning scheme that integrates multiple information. Experiments show that using a large amount of unlabeled information in the e-commerce system is of great value in product recommendation. The fast graph convolution learning framework we designed can support One billion node scale network structure application. Proving the effectiveness in product recommendation, we hope to apply this framework to more tasks in the future. In addition, considering the importance of online real-time user features, we can also consider the dynamic graph convolution model to improve the real-time features of the model. study.
In the composition, the interaction between users and products is usually the most direct and effective way of edge connection. It is an explicit description of user preferences and has achieved a certain improvement in recommendation effects. The biggest problem with this scheme is the explicit interaction data. There is a large data sparsity. In actual scenarios, there is a large amount of heterogeneous information that can be introduced to enhance the richness of network representations, such as user search terms, visited stores, preferred brands, preferred attributes, etc. These features can enhance richer semantic representation and relevance Describe, intentGC is a unified network embedding learning framework based on GCN proposed in this paper, which integrates explicit preference relationship and rich heterogeneous relationship information between users and products to improve the effect of recommendation system. The core technology in the algorithm is image volume Convolution, we have made some innovative optimizations on the basis of classic graph convolution to better solve core challenges such as strong heterogeneity and large-scale in our business.
2. Problem Definition
3. Model Design
The model designed in this paper is a large-scale graph convolution learning algorithm that integrates multiple information, using bipartite heterogeneous graph modeling, and the loss design uses triplet targets, which can effectively control and focus on learning the user's display preferences and expressions. The entire learning process is A semi-supervised model that effectively utilizes a large amount of unlabeled information in the e-commerce system to improve the accuracy of learning objectives. The core of the solution consists of three parts, one is network translation, which performs a lossless translation of the original network; the other is fast convolutional network, which performs efficient convolution on heterogeneous information; the third is dual convolution, which learns user and product representation based on HIN translation .
internet translation
The introduction of a variety of heterogeneous nodes into the network brings richer information and also poses the challenge of semantic incompatibility. The calculation of distinguishing node types is a huge complexity and computational burden for large-scale networks with various heterogeneous nodes and edges. Based on the second-order similarity, this paper uses related research to translate the original network into user-user or product-product relationship. The similarity calculation is based on the same number of additional information. The core idea is that if u1 and u2 have the same If the auxiliary information is connected, u1 and u2 are also related, so that the semantic information of heterogeneous nodes in the network can be encoded into user-user relationship or commodity-commodity relationship to realize the translation of original network information.
Fast Convolutional Network IntentNet
The original GCN has huge computational complexity when calculating on a large-scale graph, because the content will be propagated through high-order transmission methods, and the complexity is exponential. The fast convolutional network intentNet proposed in this paper can effectively solve this problem through the following two optimizations: First, in the convolution operator, not all neurons are actually equally important. In the activation process, in fact Only the most relevant neurons have the greatest effect, so we design graph convolution as sparse network activation, which can also be regarded as vector learning for channel sharing, and realize neighbor information propagation through vectorized convolution; second, we It is also found that the original high-exponential convolution complexity mainly comes from high-order nodes, but this training method can be decoupled and can be split into two training modules: graph view and node view. Based on these two observations, we redesigned graph convolution to realize feature combination through a fully connected network. Experiments show that it has better efficiency and effect than GraphSage.
a) Vectorized convolution function
Representation learning mainly does two tasks. One is to learn the relationship between the node itself and its neighbors to measure the impact of neighbors on the effect. The other is to learn the spatial relationship of vectors in different dimensions to automatically extract useful combination features. Graph convolution includes two steps, one is aggregation :
The second is convolution
This article calls it bit-wise convolution. In fact, we found that it is not necessary to calculate the interaction between all the features. We design the graph convolution as a sparse network activation, which can also be regarded as a vector learning for channel sharing. Neighbor information propagation is realized through vectorized convolution. This paper The designed vectorized convolution function is as follows:
b) IntentNet
The convolution training method is designed to be divided into two training modules: graph view and node view, and then the function of graph convolution is obtained through the combination of the two. The former is based on the above-mentioned vectorized convolution function, and stacking multiple convolution layers can effectively learn neighbors. The node propagates the relationship to realize the task of graph convolution, and the latter is connected to the fully connected layer to learn the feature relationship of different dimensional vector spaces.
dual convolution
In order to accurately describe the representation and label information of user and item, different from the traditional GCN, we designed a dual GCN structure and learned it in the same framework. The specific solution is that user performs independent convolution, item and negative sampling perform shared convolution, and then at the end of the convolution layer, the three are projected into the same semantic space through the dense network, and finally, the method of triplet loss is used for learning. The advantage of this structure is that it can have more accurate heterogeneous representation capabilities than the classic GCN. At the same time, it has been proved that this method can also make the two dual convolutions converge, and has a good semi-supervised effect.
IntentGC algorithm framework
The intentGC algorithm framework mainly includes three parts: 1) network translation; 2) training; 3) inference. After training, we can obtain the vector representations of users and products, and then use the idea of k-nearest neighbors for retrieval and recommendation.
4. Experimental conclusion
In the experiment, we mainly verify the effect comparison between IntentGC and existing algorithms, compare the efficiency of IntentNet with GraphSage in processing billion-scale graph learning tasks, and verify the comparison of model learning capabilities that add heterogeneous information. We conduct offline based on the data of Taobao and Amazon. Evaluation, comparing DeepWalk, GraphSage, DSPR, Metapath2vec++, BiNE and other algorithms, the offline evaluation results on the Taobao and Amazon datasets and the online experiments in the Taobao environment all show the effectiveness of our algorithm.
5. Summary and Outlook
This paper proposes a new large-scale graph convolution learning scheme that integrates multiple information. Experiments show that using a large amount of unlabeled information in the e-commerce system is of great value in product recommendation. The fast graph convolution learning framework we designed can support One billion node scale network structure application. Proving the effectiveness in product recommendation, we hope to apply this framework to more tasks in the future. In addition, considering the importance of online real-time user features, we can also consider the dynamic graph convolution model to improve the real-time features of the model. study.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00