How does NLP break through the capabilities of deep learning?

1. Background

In essence, it can be known that language is a set of logical symbols, which means that the input processed by NLP is highly abstract and discrete symbols, which skips the process of perception and directly focuses on various abstract concepts, semantics and logical reasoning. It is precisely because NLP involves complex cognitive features such as high-level semantics, memory, knowledge abstraction, and logical reasoning that the deep learning model based on data-driven and statistical learning has encountered a relatively large bottleneck in the NLP field. It is no exaggeration to say that the complex cognitive features in NLP have completely exceeded the capabilities of deep learning. So how to break this curse, break through the capability boundary of deep learning, and realize the key leap from perceptual intelligence to cognitive intelligence? This is why this article needs to explore. A possible way out is to integrate and distill unstructured data (such as business data, product data, and industry data) into structured business Knowledge, structured product knowledge, structured industry domain knowledge, on the basis of these structured knowledge, then use the deep learning model to reason, realize knowledge-driven, and then further advance to reason-based driving, so that A structured knowledge reasoning engine will be formed, thereby improving the cognitive ability of the entire intelligent system. The knowledge map is the infrastructure for refining and summarizing unstructured data into structured knowledge, and the graph neural network GNN is the reasoning model on the knowledge map infrastructure. In one sentence, it is: look at the world with uncertain eyes, and then use certain structured knowledge to eliminate this uncertainty.

2. Knowledge map

Before introducing the knowledge map, we must first figure out what is knowledge? Knowledge is summarized from a large amount of meaningful data, and is compressed and refined from meaningful data to form valuable laws. For example, astronomers observe the positions of various planets day and night, and the corresponding time. These are observation data, but Newton discovered the law of universal gravitation from these observation data, which is knowledge. Just as later astronomers used the valuable knowledge of Newton's law of universal gravitation to discover more unknown stars and the mysteries of the universe, knowledge will also greatly strengthen the cognitive ability of intelligent systems, and will also enable intelligent systems to go deeper. uncharted territory. Knowledge graph is the infrastructure for knowledge storage, representation, extraction, fusion, and reasoning.

Building a knowledge graph system needs to include: knowledge modeling, knowledge acquisition, knowledge fusion, knowledge storage, knowledge model mining and knowledge application 6 parts:

1. Knowledge schema modeling: build a multi-level knowledge system, define, organize, manage, and transform abstract knowledge, attributes, and relationship information into a real knowledge base.

2. Knowledge extraction: Transform data from different sources and structures into graph data, including structured data, semi-structured data (analysis), knowledge indexing, knowledge reasoning, etc., to ensure the validity and integrity of data.

3. Knowledge fusion: Due to the wide range of knowledge sources in the knowledge map, there are problems such as uneven knowledge quality, repeated knowledge from different data sources, and unclear connections between knowledge, so knowledge fusion must be carried out. Knowledge fusion is a high-level knowledge organization that enables knowledge from different knowledge sources to undergo steps such as heterogeneous data integration, disambiguation, processing, reasoning verification, and updating under the same framework specification, to achieve data, information, methods, experience, and people. The fusion of ideas forms a high-quality knowledge base.

4. Knowledge storage: Select an appropriate storage method according to business characteristics and knowledge scale to persist the fused knowledge.

5. Knowledge model mining: distributed representation learning of knowledge, new knowledge is derived from knowledge inference through graph mining related algorithms, and some hidden knowledge is mined by association rules.

6. Knowledge application: Provide analysis and application capabilities such as map retrieval, knowledge calculation, and map visualization for the constructed knowledge map. It also provides various APIs for knowledge computing, including graph basic application classes, graph structure analysis classes, graph semantic application classes, natural language processing classes, graph data acquisition classes, graph statistics classes, etc.

Speaking of so many concepts of knowledge graphs, these concepts may be somewhat abstract. Here is an example of an actual knowledge graph in the customs hscode field:

3. Graph neural network GNN

The knowledge map will inductively fuse text, pictures, time series and other data distributed according to the Euclidean space, and extract the graph structure according to the non-Euclidean space to store structured knowledge. The complexity of graph structures poses a major challenge to traditional deep learning algorithms, mainly because graph-structured data in non-Euclidean spaces is irregular. Each graph has an unlimited number of nodes, and each node in the graph has a different number of neighbor nodes, which leads to the fact that the convolution operation of traditional deep learning cannot be effectively calculated on the graph structure. At the same time, a core assumption of traditional deep learning algorithms is that sample instances are independent of each other, such as two pictures about cats are completely independent. However, this is not the case for graph-structured data. The nodes in the graph are organically combined through the connection information of the edges, thus naturally constructing powerful structural features. In addition, a major weakness of traditional deep learning recognized by the industry is that it cannot effectively perform causal reasoning, but can only perform statistical correlation reasoning in a certain sense, which greatly reduces the cognitive ability of intelligent systems. In response to the above-mentioned natural weaknesses of traditional deep learning algorithms in graph-structured data and causal inference, the industry has recently emerged a new direction for graph-structured data modeling and causal inference-graph neural network GNN.

3.1 Basic Principles of Graph Convolutional Network GCN

The graph convolutional neural network GCN is currently the most important graph neural network. The graph neural network implemented in this paper is also based on the graph convolutional neural network GCN. The graph convolutional neural network GCN is essentially a general framework based on Message-Passing information transfer. It is composed of multi-layer graph convolution operations. Each graph convolution layer only processes first-order neighborhood information. By superimposing several graphs The convolutional layer can realize the information transfer of multi-order neighborhoods. The graph neural network based on Message-Passing consists of the following three basic formulas:

The underlying operating mechanisms of almost all GNN models are based on the above three formulas, but different AGGREGATE, COMBINE, and READOUT implementation strategies are different, leading to the evolution into different types of graph neural networks such as GCN, GAT, and GraphSAGE.

3.2 AGGREGATE calculation method of graph convolutional network GCN
The AGGREGATE in the graph convolutional network GCN is to multiply each layer of the GCN through the adjacency matrix A and the feature vector to obtain a summary of the characteristics of each vertex neighbor, and then multiply it by a parameter matrix, plus the activation function σ, and do it once A non-linear transformation yields a matrix that aggregates features of adjacent vertices. The basic formula is as follows:


1. is the feature vector of the l-th layer in the graph convolutional network GCN, where image.png
is the input feature.

2. It is the parameter matrix of each layer of the graph convolutional network GCN.

3. is the adjacency matrix of the graph Graph plus the spin identity matrix of each graph node.

4. It is the degree matrix of the Graph adjacency matrix image.png.

The above is the AGGREGATE strategy of the general GCN, but such an AGGREGATE strategy is a transductive learning method. All nodes need to participate in the training to obtain the feature representation of the nodes in the graph, and the feature representation of new nodes cannot be obtained quickly. In order to solve this problem, GraphSage in reference [1] learns by sampling some nodes, and learns K aggregated AGGREGATE functions. GraphSage focuses on the design of AGGREGATE functions. It can be 𝑚𝑎𝑥, 𝑚𝑒𝑎𝑛 without parameters, or a neural network with parameters such as 𝐿𝑆𝑇𝑀. The following figure is the learning process of the AGGREGATE function of reference [1]GraphSage:

3.3 COMBINE Calculation Method of Graph Convolutional Network GCN

The calculation method of COMBINE in the graph convolutional network GCN is generally to CONCAT the vector learned by the k-th layer node through AGGREGATE and the node vector already learned by the K-1 layer, and then add a layer of neural network to the vector after CONCAT The network Dense layer is enough. The COMBINE in GCN uses concatenation to directly splice two original features, let the network learn, and determine the best way to fuse features during the learning process, so as to ensure that information will not be lost during the fusion process.

3.4 READOUT Calculation Method of Graph Convolutional Network GCN
The graph readout operation (READOUT) is used to generate the representation of the entire graph, and the feature vectors of all nodes in the integrated graph are finally abstracted to represent the features of the overall graph. GCN's graph readout operations currently have two methods based on statistics and methods based on learning.

Statistical methods

Statistical methods to implement graph read operations are generally used to obtain the abstract representation of the entire graph using sum, max, and average. The advantage of these statistics is that they are simple and do not bring additional parameters to the overall graph neural network model, but sum, max ,The unfavorable factors brought by the average are also obvious. These statistical operations will compress the high-dimensional features, so that the distribution characteristics of the data on each dimension are completely erased, and the loss of information is relatively large.

learning-based approach

The disadvantage of statistical-based methods is that they cannot be parameterized and cause information loss, so it is difficult to represent the "complex" process from node to graph vector. Learning-based methods hope to use neural networks to fit this process. At present, in reference [2], researchers from Stanford and other universities have proposed DIFFPOOL, which is a differentiable graph pooling module that can be applied to different graph neural networks in a hierarchical and end-to-end manner. DIFFPOOL can effectively learn the overall Hierarchical representation of graphs. DIFFPOOL can collapse nodes into soft clusters in a non-uniform manner, and tends to collapse densely connected subgraphs into clusters. Because GNNs can efficiently transfer information on dense, clique-like subgraphs (with small diameters), pooling all nodes on such dense subgraphs is unlikely to lose structural information. In a word, DIFFPOOL does not want each node in the graph to obtain the vector representation of the overall graph at one time, but hopes to obtain the final representation of the graph through a process of gradually compressing information, as shown in the figure below in reference [2]:

We investigate the extent to which DIFFPOOL learns meaningful clusters of nodes by visualizing cluster assignments at different layers of a graph neural network. DIFFPOOL's hierarchical aggregation distribution graph in reference [2] shows a visualization of the distribution of nodes in the first and second layers of a graph from the COLLAB dataset, and the color of the node in the graph indicates which aggregation cluster it belongs to.

4. The implementation of graph neural network based on knowledge graph in hscode product classification

Hscode is the abbreviation of "Harmonized Commodity Description and Coding System". The Coding Harmonization System is formulated by the International Customs Council. The English name is The Harmonization System Code (HS-Code). Hscode is a scientific and systematic international trade commodity classification system for the customs and commodity entry-exit management agencies of various countries to confirm commodity categories, conduct commodity classification management, review tariff standards, and inspect commodity quality indicators. The hscode has a total of 22 categories and 98 chapters. The first 6-digit codes are used internationally, and the latter codes are expanded by countries/regions according to the actual situation. The current customs code in my country is 10-digit customs codes. hscode classification is a very special NLP scenario. In a general NLP scenario, if it can be recalled in top3 based on text semantic matching, it shows that the effect of NLP is not bad. However, because hscode classification plays a key role in determining the exchange rate and regulatory conditions during the customs clearance process, hscode requires a stricter accuracy rate. This is also the biggest difference from product classification in other business scenarios. It is necessary to ensure the accuracy of top1 sex. To give a specific example, everyone may have a better sense of it.

Through the above example, we found that in terms of the semantic similarity of traditional NLP texts, the semantic matching similarity of the above texts is very high, but the hscode needs to be based on different declaration elements (such as "whether liquid elements" in the above figure, " The correct hscode can only be obtained through detailed reasoning based on the specific business knowledge of rated capacity").

4.1 Bottleneck of NLP model based on traditional deep learning in hscode classification
In hscode classification, the NLP model architecture diagram based on traditional deep learning is as follows:

The NLP model of traditional deep learning mainly includes the following parts:

1. Calculate the hscode particle vector: calculate the hscode particle vector based on the kmeans clustering algorithm of word vector average pooling, hscode's vector expression ability and anti-interference ability.

2. Hscode hierarchical classifier: through the two-layer hierarchical classifier, the candidate hscode is selected first by rough sorting.

3. Semantic reasoning in the refinement stage: Based on the hscode product classification domain-specific knowledge form, the Encoder-Decoder semantic reasoning model based on BiLSTM + Attention was finally selected.

The NLP model based on traditional deep learning has been running in the online hscode classification scenario. Through the evaluation of 2223 original classification samples submitted by online real customers by business students, the following accuracy rates are: the accuracy rate of the algorithm predicting HsCode top1, The result is as follows:

Unmodified: The business reviewer has not made any changes to the customer's original input information.

The coded product name has not been modified: the business has not modified the product name, but modified the attributes of other classification declaration elements of the product.

Other amendments: The situation of other amendments is more complicated. Since hscode classification itself is a very complicated and professional field, the information originally submitted by the customer cannot meet the customs declaration specifications, and the business officer needs to amend the information originally submitted by the customer.

Detailed analysis of the evaluation results: Since the training data based on the traditional deep learning NLP model is trained on the standardized samples reviewed by the business junior, it can be seen that the part of the original input information of the customer has not been modified, and the traditional deep learning NLP model The difference between the accuracy rate and the accuracy rate on the algorithm test set is almost 89.3%. At the same time, when the product name has not been modified, but other classification elements have been modified, the accuracy rate of the traditional deep learning NLP model is 87.7%, which proves that the traditional deep learning Learning NLP models has a certain degree of generalization. But in the third case of other modification, that is, the customer's original submission information does not meet the customs specifications, the accuracy rate of the traditional deep learning NLP model is only 17.3%. In this way, the accuracy rate on the overall sample is only 59.3% (the calculation formula is: 0.31 X 89.3% + 0.28 X 87.7% + 0.41 X 17.3% = 59.3%).

Through the above analysis, we found that the accuracy rate of the traditional deep learning NLP model is very low in the third case of other modification (the information originally submitted by the customer cannot meet the customs declaration specifications). Badcase has been disassembled to a certain extent, the main reasons are as follows:

1. The declaration elements are not missing, but the specific declaration element values entered by the customer are not standardized. The irregularities are as follows: the original input of the customer is the industry term "Lailing film" (actually a friction film), but there are no such in the nlp corpus Industry domain knowledge, so traditional deep NLP models cannot solve these cases.

2. The declaration elements are not missing, but some logic calculations are required for the declaration element values. For example, the customer enters "75% cotton, 10% wool, and 15% fiber", and when the cotton content is less than 60%, it is hscode A, and the cotton content Composition >60% is hscode B, this kind of nlp problem with logical calculation, the traditional deep NLP model can not solve these cases.

3. Too much user input of declaration elements brings too much noise. The traditional deep NLP model cannot effectively capture the structural information of the core declaration elements, and is easily biased by redundant noise.

4. There are too many others in the hscode system, unless the backing code is used, that is, except for hscode A, hscode B, and hscode C, all cases are hscode D. This estimate also requires certain reasoning to solve this problem.

5. The declaration elements are missing, and the customer has not entered the complete hscode declaration elements, which leads to incorrect classification. This estimate is not valid for any model.

Points 1 and 2 above are the lack of structured knowledge for model reasoning, and points 3 and 4 are unable to effectively capture structured features for causal logic reasoning, so we landed on the GCN model based on knowledge graphs. Try to solve the above problems.

4.2 Knowledge schema modeling of knowledge graph in hscode classification scenario
The metadata schema information definition of the knowledge map is very important. At the beginning of the design, it is necessary to consider not only the relationship between ontology, but also the dimension change of ontology schema. The schema of the hscode product classification knowledge engine is as follows:

4.3 The overall algorithm architecture of the GCN model based on knowledge graph

1. In the field of hscode, the entity of the underlying knowledge map construction is the product name, the specific declaration element value, the declaration element corresponds to the attribute value of the industry field, and the edges in the map are the keys of different declaration elements, so the underlying graph structure is a different relationship edge Composition of heterogeneous diagrams. At the same time, because of the same product, the different values of declared elements will lead to different hscode (label), so here hscode (label) will be marked on the sub-graph constructed with the product name and declared element value, and such a sub-graph will be used as the prediction entity node.

2. In the field of hscode, each entity node of the knowledge map stores the word2vec semantic vector after vector average pooling, which also makes the map have semantic generalization, which is different from the previous knowledge map stored in each node Entities dominated by literal text are different.

3. In the field of hscode, the structural knowledge stored in the knowledge graph needs to be fused and transformed into the embedding input layer of the GCN model through the semantic features of the node text in the graph, the features of different edges in the graph, and the structural features of the graph.

4. In the field of hscode, the aggregation mechanism of the neighbor nodes of the graph neural network is as follows:

4.5 Landing effect
From the detailed analysis in 4.1, it can be seen that the accuracy rate of the NLP model based on traditional deep learning on the hscode line: the accuracy rate is only 59.3% (the calculation formula is: 0.31 X 89.3% + 0.28 X 87.7% + 0.41 X 17.3% = 59.3 %). The GCN model based on the knowledge map took all the original customer submission data from 2017-01-01 to 2020-01-08 for training, and took 5687 online real data from 2020-01-09 to 2020-01-12 The data originally submitted by the customer was used as the test data. The result of the business evaluation is: the accuracy rate of the GCN model based on the knowledge graph reached 76%, which was 16.7% higher than the accuracy rate of the previous traditional deep learning NLP model, which proves that the knowledge-based The GCN model of graph has better fault tolerance.

5. Experiment

At present, two comparison experiments have been set up, and the comparison index is the accuracy rate on the test set. One is the comparison of the graph level READOUT strategy sum and average of the graph neural network GCN, and the other is the graph structure of the underlying knowledge graph to make some changes, one is a simpler star structure, and the other is a complex graph structure.

Comparison of graph level READOUT strategy sum and average of graph neural network GCN
While keeping other parameters of the GCN model of the knowledge graph unchanged, change the graph level READOUT strategy of GCN from sum to average, and observe the accuracy rate on the test sample:

Some changes have been made to the graph structure of the underlying knowledge graph, one is a simpler star structure, and the other is a complex graph structure
In the case of keeping the parameters of the GCN model of the knowledge graph the same, the storage structure of the underlying knowledge graph is a simple star structure, that is, only the product name-declaration elements are associated with edges, and no edges are generated between the declaration elements Another knowledge map storage structure is a complex graph structure. In addition to the relationship between product names and declaration elements, there is also an edge relationship between sensitive declaration elements, and there is also an edge association between attribute values and declaration elements. .


Through experiments, it can be found that the richer the underlying structured knowledge of the knowledge graph, the higher the accuracy of the GCN model based on the knowledge graph. Changing the READOUT strategy of the entire graph neural network model will also help improve the accuracy of the model. Next, more experiments will be done to fully tap the potential of the knowledge map-based GCN model in NLP.

6. Future thinking

1. In many business fields, a large amount of business rule knowledge is manually sorted out. How to extract and integrate these heterogeneous rule knowledge into the knowledge graph to further improve the structured reasoning ability of the knowledge graph. How to use graphs to store the multi-level logic rules for business sorting. The artificial rules of business sorting are similar to the organizational form of rule trees, which organically organize atomic logical propositions such as and, or, and not. Here, how to abstract entities and relationships from the multi-level logical rule trees of business sorting, so as to transform them into graphs Structure is also a problem that needs to be tackled in the future.

2. How to effectively integrate the rules with the graph neural network. For example, a large number of manual rules have been deposited in the hscode field. These rules are valuable knowledge wealth. If these rules are used as a teacher-network to guide the student-network of hscode classification tasks, the accuracy of the hscode field will be greatly improved. The teacher-network of rules is equivalent to playing the role of guidance and constraint, and the constraint subspace of each rule learned by the rule teacher-network is more conducive to semantic reasoning. Here, how to convert the rules into teacher-network, and then combine it with the graph neural network of the knowledge map is also an important optimization direction.

3. The current knowledge map is mainly based on text. A truly perfect knowledge map should be a multi-modal structured knowledge. For example, in addition to text, it should also have multi-modal information such as pictures and voice. Only multi-modal Structural knowledge can further promote the cognitive ability of the entire intelligent system.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us