AliGraph: an industrial graph neural network platform

Why focus on GNNs?

In the context of big data, using high-speed computers to discover patterns in data seems to be the most effective means. In order to make machine calculations purposeful, human knowledge needs to be used as input. We have successively experienced three stages: expert system, classic machine learning, and deep learning. The input knowledge has changed from concrete to abstract, from rules to features and then to models, and it has become more and more macroscopic. Relatively speaking, the level of abstraction has become higher and the coverage has become wider, but our perception of the underlying layer has weakened, and the interpretability of the model has deteriorated. The application of deep learning has allowed us to see very considerable value, but the explainability work behind it is progressing slowly, and because of this, when we use deep learning to solve sensitive issues involving personal and property safety, laws, etc., there are only digital effects It is not enough to support the application of this technology, we need to know the reasons behind the results.

Graph is the carrier of knowledge, and the physical connection between them implies a strong causal relationship. Importantly, this is an intuitive, human-readable structure. Using Graph as a knowledge support and using the generalization technology of deep learning seems to be a feasible direction. On some issues, it is a step closer to our interpretability goal. Various deep learning-related top conferences In the distribution of papers in recent years, the graph neural network (GNN) has been in a vigorous state. GNN provides a way to solve problems with a wide coverage. Many search and recommendation algorithms can be incorporated into the GNN paradigm. Therefore, no matter from the perspective of future technical reserves or current application expansion, GNN is a very worthwhile investment. direction.

Compared with mature technologies such as CNN and RNN, GNN is still in the exploratory stage. Graph is to GNN, not as natural as image is to CNN, and natural language is to RNN. Even with Graph data, how to use GNN has no fixed pattern to follow, and there is no precipitated convolution-like operator that can be called directly. The effectiveness of GNN requires more scenarios to verify, and each scenario requires a deep understanding of developers, who are capable of processing Graph data and writing deep learning models on top of it. With the application scenarios in full bloom, it is possible to abstract common GNN operators and algorithms, and then give these relatively mature capabilities to users, so that GNN will be truly popularized. Based on these considerations, rather than developing a mature algorithm for users to use, the current stage of the platform will focus more on providing APIs to developers, so that developers have the ability to implement GNN close to their own scenarios.

On the other hand, Graph data in industrial scenarios is very complex and has a huge amount of data. The platform cannot exist independently without being separated from the scene. It must be driven by business, so that it is most likely to incubate products with real value. Taking Alibaba's e-commerce recommendation scenario as an example, the graph data generated every day reaches hundreds of terabytes, and is highly heterogeneous (multiple types of vertices, multiple types of edges). Vertices and edges have rich attributes, such as The name, category, price range of the product, and even its associated images, videos, etc., these attributes exist in plain text rather than structured information that has been vectorized. Taking such data as input, how to train GNN efficiently is a very challenging problem. If data preprocessing, pre-training and other means are used to structure and vectorize Graph data, a large amount of computing resources, storage resources and labor costs will be consumed. A platform that is truly friendly to GNN developers should be end-to-end. In a set of IDE, users can not only manipulate complex graph data, but also connect the data with deep neural networks and freely write upper-layer models. The platform provides a simple and flexible interface to meet the scalability and ecological compatibility required for the rapid development of GNN, as well as the large-scale and stability for complex distributed environments.

technology stack
Hierarchical architecture

AliGraph covers the overall link from original graph data to GNN applications, reducing the exploration cost of GNN algorithms to the same level as traditional deep learning algorithms. The platform can be viewed in layers: data layer, engine layer, and application layer.
The data layer supports large-scale isomorphic graphs, heterogeneous graphs, and attribute graphs. The data does not need to be built in advance, and the platform provides APIs to simplify the process of data analysis and map building. The data layer interface is easy to expand, and it is convenient to connect Graph data in different formats and media.

Engine layer, including Graph Engine and Tensor Engine. Graph Engine can be divided into logical object layer and operator layer. The logical object layer describes the form that is displayed to the user after the original data is loaded into the system. Each object entity provides a related semantic interface. For example, for a Graph object, the topological information, degree of heterogeneity, and number of vertices and edges of the graph can be obtained. For users, they only need to declare a logical object and specify its data source in actual use.

Operator layer, computing operations that can be performed on top of logical objects. For example, for the Graph object, various Sampler operators are supported to provide input to the upper-layer GNN algorithm. The operator layer has strong scalability to meet the needs of diverse scenarios for operator types. Currently, the built-in supported operators revolve around the GNN algorithm and ecology, including graph query, graph sampling, negative sampling, KNN, etc.

Tensor Engine refers to a deep learning engine, such as TensorFlow, PyTorch, or other libraries that support Python interfaces. The output of GraphEngine is format-aligned NumPy object, which can be seamlessly connected with deep learning engine. GNN developers can freely write NN logic on Graph, and combine it with business requirements to form a deep network model for end-to-end training.

The application layer emphasizes the end-to-end integration with the business, rather than separating and using the results of Graph Embedding. The mature algorithm polished by the scene will also be precipitated to the application layer and provided to users in the form of algorithm components.

Extended by the GCN framework, the typical GNN programming paradigm can be summarized as follows, and the system is designed to efficiently support this paradigm.

Among them, vectorization and aggregation operations can take advantage of the expressive ability of the deep learning engine. Therefore, in order to realize the above computing mode, it is mainly about graph-related operations and how these operations are connected with the deep learning engine. We refine the technology stack as shown in the figure below, where Storage, Sampler, and Operator are the main problems to be solved by the system. The information propagates forward between layers from bottom to top, and the gradient updates the parameters of each layer from top to bottom. The entire GNN application is described in a deep network. The Graph object in the Storage layer is logical storage, under which there is an abstract file interface that can be adapted to a variety of data sources, which is the prerequisite for the system to be portable. Sampler provides a wealth of operators, which can be expanded independently and does not depend on the system framework to meet diverse needs. Operator encapsulates graph semantic operations, and hides performance optimization and data connection under a concise interface.

Efficient graph engine
More specifically, the graph engine is a bridge connecting the graph data and the deep learning framework, ensuring the efficiency and stability of data transmission. The graph operation here is oriented to GNN, which is very different from the general graph calculation. Graph Engine is a distributed service with high performance and high availability. It supports the construction of heterogeneous graphs with tens of billions of edges within 2 minutes, multi-hop cross-machine sampling in ten milliseconds, and supports lossless status from failures. the failover. Graph Engine internally optimizes the RPC process to achieve zero copy of data, and the connection between servers is thread-level. While maximizing bandwidth utilization, each thread can independently process requests without locks. This is also the main reason for the excellent system performance. In addition, we have accelerated sampling and negative sampling through effective Cache, decentralization and other means, and the performance has been significantly improved.

Operators can be extended
In order to support the rapid development of GNN, the system allows operators to expand freely. The system framework includes three parts: user interface, distributed runtime, and distributed storage. Call an operator through the user interface, and the operator reads data and completes distributed computing. We refine the distributed runtime and storage interfaces, and control the programming interfaces within the security range. Users can develop a custom operator based on these interfaces. Custom operators can be uniformly registered on the user interface without adding a new user API. Specifically, each type of Operator is a distributed operator, and the data required for calculation will be distributed on each server of the Service. We abstract the semantics of Map() and Reduce(), and Map() is used to split the calculation request And forward it to the corresponding server to ensure the data and calculate the colocate to avoid the cost of data relocation. Reduce() integrates the results of each server. Operators also need to implement Process() for local computing, but do not need to care about data serialization and distributed communication.

make achievement

Data types: support isomorphic graphs, heterogeneous graphs, attribute graphs, directed graphs, and undirected graphs, which can be easily connected to any distributed file system.

Data scale: supports ultra-large-scale graphs with hundreds of billions of edges and billions of vertices (raw storage of TB).

Operator types: supports dozens of graph query and sampling operators that can be combined with deep learning, supports vector retrieval, and supports operator customization on demand.

Performance indicators: Support minute-level ultra-large-scale graph construction, millisecond-level multi-hop heterogeneous graph sampling, and millisecond-level large-scale vector retrieval.

User interface: pure Python interface, which forms an integrated IDE with TensorFlow, and the development cost is no different from that of general TF models.


Already supports the mainstream GraphEmbedding algorithms in the industry, including: DeepWalk, Node2Vec, GraphSAGE, GATNE, etc. A variety of self-developed algorithms are being planned to be released, and the published related papers are referenced below.

Within the Alibaba Group, it has covered Taobao recommendation, Taobao search, new retail, network security (anti-terrorism, spam or anomaly detection, anti-cheating), online payment, Youku, Ali Health and other related businesses. Typical scene effects are as follows:

Guess what you like on the mobile phone Taobao homepage, cloud theme recommendation (5500w PV per day)

Compared with the GE model implemented by other systems, the implementation of AliGraph can save 300TB of storage and 10,000 CPU hours of computing power for a single task, and shorten the training time by 2 /3, CTR increased by 12%.

Security-related, anti-terrorism, garbage detection, exception recognition and other 5 scenarios

For heterogeneous graphs with 3 billion edges and 100 million vertices in a single day, the training time is shortened by 1/2, and the model coverage accuracy rate is increased by 6%-41%.

In addition, AliGraph has been released on the Alibaba Cloud public cloud platform, and we will continue to update it. We hope to see GNN bring better solutions to more scenarios, and hope that more researchers are willing to invest in this direction.


This article gives an overview of the AliGraph platform. While conveying our thinking behind it, we hope to bring convenience to more researchers in the direction of GNN. We also hope that interested students will join us to jointly build the influence of GNN and implement it into practical applications. .

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us