Ali Knowledge Engine

Alimei's Guide: In April 2018, Alibaba Business Platform Division - Knowledge Graph Team teamed up with Tsinghua University, Zhejiang University, Institute of Automation, Chinese Academy of Sciences, Institute of Software, Chinese Academy of Sciences, Suzhou University and other five institutions to jointly release Cangjing Pavilion (knowledge engine) research plan.

The Cangjingge Project relies on Ali's powerful computing power (such as the Igraph graph database) and advanced machine learning algorithms (such as the PAI platform). Since the release of the plan, what technological breakthroughs have the Ali Knowledge Graph team made? Let's find out today.


One year since the release of the Cangjingge plan, we have redefined the knowledge engine technology and defined it into five technical modules: knowledge acquisition, knowledge modeling, knowledge reasoning, knowledge fusion, and knowledge service, and developed and implemented them.

Among them, the task of knowledge modeling is to define the knowledge representation method of the concepts, events, rules and their interrelationships described in general/specific domain knowledge, and to establish the conceptual model of general/specific domain knowledge graph; knowledge acquisition is the knowledge defined by knowledge modeling The acquisition process of instantiating elements, structuring unstructured data into knowledge in the graph; and knowledge fusion is the process of semantic integration of heterogeneous and fragmented knowledge, by discovering the relationship between fragmentation and heterogeneous knowledge , obtain a more complete knowledge description and the relationship between knowledge, and realize knowledge complementation and integration; knowledge reasoning is the process of providing knowledge calculation and reasoning models based on knowledge graphs, and discovering relevant knowledge and tacit knowledge in knowledge graphs. Knowledge service is to provide knowledge-based intelligent services through the constructed knowledge graph, and improve the intelligent service capability of the application system.

Figure 1 Cangjing Pavilion - Knowledge Engine Product

After a year of work, in the knowledge modeling module, we developed algorithms such as automatic construction of Ontology and automatic attribute discovery, and built a tool for building knowledge graph Ontology; in the knowledge acquisition module, we developed new entity recognition, compact event recognition, relationship Extraction and other algorithms have reached the highest level in the industry; in the knowledge fusion module, we have designed deep learning algorithms for entity alignment and attribute alignment, so that they can achieve better scalability in different knowledge bases, and greatly enrich the knowledge graph. Knowledge; In the knowledge reasoning module, we propose a knowledge graph representation learning model CharTransE based on Character Embedding, an interpretable knowledge graph learning representation model XTransE, and develop a powerful reasoning engine.

Based on the above technical modules, we have developed a general knowledge engine product, which has been successfully applied in dozens of products such as Taobao, Tmall, Hema Xiansheng, Fliggy and Tmall Genie in the whole Ali economy. Every day There are more than 80 million online calls, and an average of 900 million pieces of knowledge are output offline every day. At present, on the knowledge engine product, it has successfully built and operated five vertical field map services such as commodities, tourism, and new manufacturing.

Figure 2 Diagram of four levels of knowledge engine

During the construction of each module, we have successively overcome a series of technical problems. This article will select two of them to introduce to you:

1. A Named Entity Recognition Approach for Adversarial Learning on Crowdsourced Data

The knowledge acquisition module includes basic tasks such as entity recognition, entity linking, new entity discovery, relation extraction, and event mining, among which entity recognition (NER) is the core task.

The current best named entity recognition algorithms in academia are mainly based on supervised learning. The key to building a high-performance NER system is to obtain high-quality annotated corpus. However, high-quality labeled data usually requires experts to label data, which is expensive and slow. Therefore, the more popular solution in the industry is to rely on crowdsourcing to label data. Therefore, the effect of the algorithm trained with it will be affected. Based on this problem, we propose a method to design an adversarial network for crowdsourced annotation data to learn commonalities among crowdsourced annotators, remove noise, and improve the performance of Chinese NER.

The specific network framework of this work is shown in Figure 3:

Figure 3 Entity recognition model based on adversarial network

Annotator ID: For each annotator ID information, we use a Looking-up table, which stores the vector representation of each WorkerID. The initial value of the vector is initialized with a random number. During the model training process, all the values ​​of the ID vector are used as the parameters of the model, and are optimized together with other parameters in the iterative process. During training, for the annotator of each annotated example, we directly obtain the corresponding ID vector representation by looking up the table. At test time, due to the lack of annotator information, we use the average of all vectors as the ID vector input.

Adversarial learning (WorkerAdversarial): Crowdsourced data is used as training corpus, and there is a certain amount of labeling errors, that is, "noise". These mislabeling or labeling errors are brought by the annotators. Different annotators have different understanding and background knowledge of the specification. The LSTM modules of adversarial learning are as follows:

The LSTM of private information is called "private", and its learning goal is to fit the independent distribution of each annotator; while the LSTM of shared information is called "common", its input is a sentence, and its role is to learn between the annotation results. the common characteristics of


The LSTM for labeling information is called "label", and takes the labeling result sequence of the training example as input,

Then, the label and common LSTM features are combined through the annotator classifier, input to the CNN layer for feature combination extraction, and finally the annotator is classified. It should be noted that we hope that the annotator classifier will eventually lose its judgment ability, that is, the learned features have no ability to distinguish the annotators, that is, the common features. So when the training parameters are optimized, it needs to be updated in reverse.

In the actual entity recognition task, we combine the common and private LSTM features and the annotator ID vector as the input of the entity annotation part, and finally use the CRF layer to decode to complete the annotation task.

The experimental results are shown in Figure 4. Our algorithm achieves the best performance on both datasets of product Title and user search query:

Fig. 4 Experimental results of entity recognition model based on adversarial network

2. Knowledge graph reasoning algorithm based on rules and graph embedding iterative learning

Knowledge graph reasoning calculation is an indispensable technical means to supplement and verify graph relationships and attributes. Rules and embedding (Embedding) are two different ways of knowledge graph reasoning, and each has its own advantages and disadvantages. The rules themselves are accurate and understandable, but most rule learning methods face efficiency problems on large-scale knowledge graphs, while embedding ( Embedding) representation itself has strong feature capture ability, and can also be applied to large-scale and complex knowledge graphs, but good embedding representation depends on the richness of training information, so it is difficult to learn good embeddings for sparse entities express. We propose an idea of ​​iteratively learning rules and embeddings. In this work, we use representation learning to learn rules, and use the rules to predict latent triples for sparse entities, and add the predicted triples to During the learning process of the embedding representation, iterative learning is then performed continuously. The overall framework of the work is shown in Figure 5:

Figure 5 Experimental results of entity recognition model based on adversarial network

The objective function for embedding learning optimization is:

The objective function for embedding learning optimization is:


lsro represents the tag of the triplet,

Represents the scoring function of triples, vs represents the mapping of subjects in the graph triples, Mr represents the mapping of the relationship between two entities in the graph, and vo represents the mapping of objects in the graph triples.

Based on the learned rules (axiom), inference execution can be performed. Through an iterative strategy, first use the embedding method to learn the rules from the graph, then execute the rules inference, and add the new relationship to the graph. Through this continuous learning and iterative algorithm, the graph can be The relationship prediction in China is getting more and more accurate. In the end, our algorithm achieved very good performance:

In addition to the above two works, we also have a series of cutting-edge work in the research and development of knowledge engine technology, which has achieved industry-leading results. The research results have been published in AAAI, WWW, EMNLP, WSDM and other conferences.

After that, the Alibaba Knowledge Graph team will continue to promote the Cangjingge plan, build a universal and transferable knowledge graph algorithm, and output the data in the knowledge graph to various applications inside and outside Alibaba, and plug AI into these applications. The wings have become the infrastructure of Alibaba's economy and even the whole society.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us