E-commerce Knowledge Graph AliCoCo


In recent years, e-commerce search and recommendation algorithms have made great progress, but in the face of the diverse needs of users, the current e-commerce experience is still not "smart". For many years, our search engine has been guiding users how to enter keywords to find the products they need faster, and this keyword-based search is suitable for users who are clear about specific products. But many times, users are often faced with some problems or scenarios, such as "hosting an outdoor barbecue" what tools are needed? What products can be purchased on Taobao to effectively "prevent the loss of the elderly at home"? They need more "knowledge" to help them make decisions. And in product recommendation, problems such as repeated recommendation, recommendation after buying, lack of novelty in recommendation, etc. are often criticized. The current recommendation system is more based on the historical behavior of users, recalling products through i2i and other means, rather than really starting from modeling user needs.

The reason behind these problems is that the underlying data that e-commerce technology relies on lacks a description of user needs. Specifically, the current system Taobao uses to manage commodities is a system based on category-property-value (CPV, Category-Property-Value), which lacks the necessary breadth and depth of knowledge to describe and understand various As a result, search and recommendation algorithms based on this create a semantic gap when recognizing real user needs, thus limiting the further improvement of user experience.

In order to break this gap and allow e-commerce search and recommendation algorithms to better recognize user needs, we propose to build a new e-commerce knowledge graph, explicitly expressing user needs as nodes in the graph, and constructing a user demand node A large-scale semantic network linking user needs, knowledge, common sense, products and content: Alibaba E-commerce Cognitive Concept Net (Alibaba E-commerce Cognitive Concept Net), referred to as AliCoCo. We hope that AliCoCo can provide a unified data foundation for user understanding, knowledge understanding, product and content understanding in the e-commerce field. After two years of hard work, we have completed the overall structural design and core data construction, and implemented it in multiple specific business scenarios such as e-commerce search and recommendation, achieving good results and improving user experience.

As shown in the figure below, AliCoCo is a concept map, which mainly consists of four parts:

E-commerce Concepts
Atomic Concepts (Primitive Concepts)
Product layer (Items)

In the E-commerce Concepts layer (E-commerce Concepts), as the biggest innovation point of AliCoCo, we explicitly express user needs as nodes in the graph with a phrase in line with the human dialect, such as "outdoor barbecue (outdoor barbecue)", "Keep warm for kids" and so on, and call it "e-commerce concept". Although user needs have been mentioned all the time, they have not been formally defined in the field of e-commerce. In the work of many downstream applications (such as recommendation systems), category or category nodes (category of goods) are often used as the expression of user needs. But user needs are far more than these. In many cases, users are faced with a "scene" or "problem", and they don't know what specific products can help solve it. Therefore, we further generalize the definition of user needs into electronic The concept of quotient, see the following chapters for details. All e-commerce concepts used to represent user needs make up this layer.

At the atomic concept layer (Primitive Concepts), in order to better understand the e-commerce concept mentioned above (that is, user needs), we disassemble these phrases and refine them into word granularity, and use these fine-grained words to make more systematic To accurately describe user needs, these fine-grained words are called "atomic concepts". For example, for the e-commerce concept "outdoor barbecue", it can be expressed as "action: barbecue & location: outdoor & weather: sunny", where "barbecue", "outdoor" and "sunny" are all atomic concepts. All atomic concepts make up this layer.

In Taxonomy, in order to better manage the above-mentioned atomic concepts, we have constructed a taxonomy that describes the basic concepts of the world. It is not limited to the field of e-commerce, but is currently reserved for conceptual understanding Serve. In this layer, we define first-level categories (classes) such as "time", "place", "action", "function", "category", "IP", etc., and continue to subdivide under each category Subcategories are generated to form a tree structure. In each category, the instance of the category is included, that is, the atomic concept. For example, the above-mentioned "barbecue", "outdoor" and "sunny" belong to "action-consumable action" and "location-public space" and "Time - Weather". At the same time, there are different relationships between different categories, such as "category-apparel-clothing-pants" and "time-season" define a "applicable to (season)" relationship. Therefore, there will be a triplet instance correspondingly: .

If the above classification system and the atomic concept are combined, it can actually be regarded as a relatively complete ontology (Ontology). It is very similar to the knowledge graphs in the open domain such as Freebase and DBpedia. The only difference is our Instances not only have entities (entities), but also include a large number of concepts (concepts). Compared with concept maps such as Probase and ConceptNet, we have defined a complete type system.

At the product (content) layer, the billions of products and content on the Alibaba platform will be associated with the e-commerce concept and the atomic concept layer. For example, products associated with "outdoor barbecue" may include barbecue grills, charcoal fires, ingredients, and so on. But one thing to note here is that some products can be related to the e-commerce concept of "outdoor barbecue", but not necessarily directly related to the corresponding atomic concept "outdoor". For a product, the concept of e-commerce is like a certain scenario where the product will be used, while the concept of atom is more like a fine-grained attribute used to describe the characteristics of the product.

To sum up, in AliCoCo's system, user needs are expressed as phrase-level e-commerce concepts. Below this, there is a well-defined taxonomy and atomic concept instances to describe all e-commerce concepts. Finally, all products on the e-commerce platform will be associated with e-commerce concepts or atomic concepts. Below, we detail the details of each layer and the algorithmic problems encountered during construction.

AliCoCo's classification system is a huge tree structure that contains millions of instances of atomic concepts. Due to the construction of the classification system, the requirements for expert knowledge are very high, and the design of this part is crucial to the entire knowledge system, so we manually defined about 20 first-level classifications (pictured below), among which are specifically for the e-commerce field The designed items are: "category", "pattern", "function", "material", "color", "shape", "smell" and "taste". Each first-level classification will be further subdivided into second-level, third-level, and leaf classifications. Among them, the most important "category" for the e-commerce field includes about 800 leaf classifications. Classifications such as "time", "place", "audience", and "IP" can be blended with knowledge graphs in open fields. For example, "IP" contains a large number of stars, athletes, movies, music, etc.

Atomic Concepts (Primitive Concepts)
At the atomic concept level, we hope that these fine-grained words can completely describe all user needs, which is the basis for forming the concept of e-commerce. At this level, we mainly discuss two issues:

Mining of Atomic Concept Vocabulary
Construction of Hypernymy Relationship Between Atomic Concepts
vocabulary mining
After the classification system is defined, there are generally two ways to quickly expand the instances (vocabularies) under the category. The first is to fuse structured data from multiple sources. The technique used in this method is usually ontology matching. In practice, we mainly use rules + manual mapping to align structured data from different sources. into our taxonomy for lexical fusion. The second is to supplement the vocabulary under the classification through automatic mining on a large-scale corpus. Here we define it as a sequence tagging task, and use a model based on BiLSTM+CRF [1] to mine and discover new words under the classification . Since the number of leaf classifications is too large, we use the first-level classification as the label, and first conduct coarse-grained mining on the vocabulary.

The picture above is a simple illustration of the BiLSTM+CRF model. The BiLSTM (bidirectional LSTM) layer is used to capture the upper and lower semantic features of the sentence, while the CRF (Conditional Random Field) layer is used to capture the relationship between the label of the current word and the labels of the preceding and following words. Correlation. After the new words that may belong to a certain category are obtained through model mining, they will go through manual checks such as crowdsourced launch review and outsourced quality inspection, and will finally be put into the warehouse and become real atomic concepts. Different atomic concepts may have the same name, but they belong to different categories and represent different semantics. Each atomic concept has an ID, which is also the basis for AliCoCo to disambiguate concepts in the future.

Hypernymy relationship construction
After mining a certain amount of vocabulary under a certain first-level classification, we need to continue to classify all vocabulary into categories at different levels. This process can be abstracted into a process of hypernym discovery: given a For hyponyms, find their possible hypernyms in the vocabulary. We use a combination of pattern-based unsupervised methods and projection learning-based supervised methods to complete the construction of the hyponymy relationship.

Pattern based

The pattern-based method [2] is the most intuitive and accurate method. By inducing and discovering some patterns that can be used to judge the relationship between the hyponym and the hyponym, the hyponym pairs are directly extracted from the text sentence. Typical patterns such as "XX, a kind of XX", "XX, including XX" and so on. But the disadvantage of this method is that the default hyponym pairs must co-occur in the sentence, which will affect the recall. In addition, using some characteristics of Chinese, we can use "XX trousers" must be "pants", etc. to automatically construct a batch of high-confidence subordinate relationships.

Projection learning

The method of Projection learning is given a hyponym embedding and a hypernym embedding , supervised to learn a mapping function so that and are as close as possible. There are many previous works [3, 4] in this regard, some of which will cluster different words first, and learn different mappings for each category, and have achieved good results. Specifically, we learn a scoring function to represent the strength of the hyponymy relationship between a pair of candidate words, and use multiple matrices to simulate features of different dimensions (implicit clustering), where the kth score is calculated like:. Finally, the k scores are passed through a layer of full connection to get the final probability:. Afterwards we employ the cross-entropy loss function for training. The pre-trained word vectors used in the model are trained with word2vec on the aforementioned e-commerce corpus. At the same time, we use ALaCarte embedding [5] to enhance the sparseness of some category words in the corpus. The main idea is to learn a mapping relationship matrix and use the sum of the context embeddings around the sparse words to represent it :

Active learning

Model output candidate and crowdsourcing review are a simultaneous process, and the manually reviewed data can continuously feed back to strengthen the model. Therefore, in the iterative process, we consider using active learning to further improve efficiency and reduce the cost of manual review. We adopted a sampling strategy of uncertainty and high confidence (UCS). In addition to considering the samples that are difficult for the model to judge positive and negative (the predicted value is close to 0.5), we also added a certain proportion of samples with high confidence. This is because it is easy to be disturbed by synonymous or related relations in the discrimination of the upper and lower relations, especially when the number of samples in the early stage is small and the quality is different, and the negative sampling is not balanced. Distinguishing between correlation and hyponym performance is not so good. Correcting such judgment errors by manual labeling can punish such misjudgments in time. Experiments show that such a strategy can help us reduce labor costs by 35%.

E-commerce Concepts
In the e-commerce concept layer, each node represents a shopping demand, which can be described by at least one atomic concept. We first introduce the definition of the e-commerce concept, then introduce how the e-commerce concept is mined and generated, and finally introduce the link between the e-commerce concept and the atomic concept.

Definition of e-commerce concept
We define a standard-compliant e-commerce concept that needs to meet the following requirements:

1) There is a consumer demand

That is to say, an e-commerce concept must be able to make people naturally associate with a series of products, such as "blue sky", "hen laying eggs", etc. are not e-commerce concepts.

2) smooth

For example, "careful mommy soap" is not an e-commerce concept.

3) reasonable

That is, an e-commerce concept must conform to human common sense. For example, "European-style Korean-style curtains" and "children's sexy dresses" are not e-commerce concepts, because a curtain cannot be European-style or Korean-style, and we usually don't use it. Sexy to embellish a child's dress.

4) Point to clear

That is to say, an e-commerce concept must have a clear audience. For example, "children's baby food supplement" is not an e-commerce concept, because there is a big difference between children's food supplements and baby food supplements, which will cause confusion among users.

5) No typos

For example, "Indushen Oil" and so on.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us