How to embrace embedding?

Vector Semantics

Word semantics can be subdivided into different aspects from a linguistic point of view:

Synonyms: couch/sofa, car/automobile

Antonyms: long/short, big/little

Word Similarity: cat and dog are not synonyms, but they are similar.

Word Relatedness: Words and words can be related, but not similar. coffee and cup are not similar, but they are clearly related.

Semantic field/topic models LDA: Some words belong to a certain semantic domain and are strongly correlated with each other, restaurant(waiter, menu, plate, food, chef), house(door, roof, kitchen, family, bed)

Semantic Frames and Roles: some words belong to the same role of an event, buy, sell, pay belong to different roles of a purchase.

Hypernym/hyponym: A word that is the parent of another word is called a hypernym, and vice versa is called a hyponym, vehicle/car, mammal/dog, fruit/mango.

Connotations, emotions, sentiment, opinions: positive and negative emotional words happy/sad, positive and negative evaluation words (great, love)/(terrible, hate).

A perfect Vector Representation hopes to describe all levels of semantics of the above words, but it is obviously unrealistic. So far the most successful model for representing word semantics is Vector Semantics, which is the basis of word embedding technology. Vector Semantics consists of two parts:

distributional hypothesis (the theoretical basis of all semantic vectors): Words that occur in similar contexts tend to have similar meanings, define a word by its distribution in texts.

defining the meaning of a word w as a vector, a point in N-dimensional semantic space, which learns directly from their distributions in texts.
Using Vector to represent word semantics makes it easier to calculate word similarity.

Co-occurrence Matrix and Basic Vector Semantics models

Vector Semantics Models are usually constructed based on a co-occurrence matrix, which is used to represent the co-occurrence rules of elements, which is actually the specific implementation of the distributional hypothesis mentioned above. According to the different elements, there are mainly two co-occurrence matrices: Term-document matrix (used mostly for retrieval) and word-word matrix (used mostly for word embedding).

■ Term-document matrix and TF-IDF model

Term-document matrix: Each row is a word in the vocabulary, each column is an article in the article collection, and each value represents the number of times the word (row) appears in the article (column). The article can be represented as a column vector, and the words that appear in the two articles are similar and generally relatively similar (all sports docs will have similar entries):

Term-document matrix is widely used in information retrieval to find similar articles, but there is a problem with the basic Term-document matrix. Directly using the number of co-occurrences and those nonsense words (stopwords) that appear frequently will have a large Weight, so the TF-IDF model needs to be introduced.

TF-IDF (term frequency–inverse document frequency) is the basic weighting technology of NLP and the mainstream co-occurrence matrix weighting technology in information retrieval. It is used to describe the importance of a word to an article. The main idea is that a word appears in an article with a high frequency TF and rarely appears in other articles. It is considered that this word has a good category distinction ability. The formula is as follows:

So you can improve the Term-document matrix into a TF-IDF weighting matrix to suppress meaningless words that appear very frequently (such as good):

Practical voiceover: To calculate each tfidf value of the corpus, the intuitive idea is to first construct the Term-document matrix statistically and then calculate the TF-IDF weighting matrix.

■ Word-word matrix and PMI model

word-word matrix: Similar to the Term-document matrix, except that each column becomes a context word, and each cell represents the number of times the target word (row) and the context word (column) co-occur in the corpus. The context is often a window around the word. The word-word matrix is often used in the calculation of count-based word embedding, which can capture syntactic/POS information (small context window) and semantic information (large context window):

Pointwise Mutual Information (PMI) is one of the most important basic concepts in NLP. It can effectively describe whether two words are strongly associated (strongly associated), and calculates the number of times two words co-occur in the corpus. More than we a priori expect their probability to occur by chance:

The following Positive PMI (PPMI) is often used in practice.

The word-word co-occurrence matrix can be rewritten as a PPMI matrix using the following formula:

The first step is to change the number of word co-occurrences to the following joint probability.

The second step is to calculate the PPMI value. The PPMI matrix can well describe the correlation between words and words.

‍‍There is a problem with the basic PPMI formula, which will give a large PMI value to low-frequency words (the smaller the denominator, the larger the value). A common approach is to fine-tune the calculation logic of p(c) to narrow the probability gap between low-frequency words, where α=0.75.

Related thinking: For a set of numbers, softmax is to make the larger one bigger, and p(c) is the conventional probability calculation, which is to narrow the gap between large and small numbers. The Negative sampling technology mentioned below also uses this.

The vectors generated with the word-word co-occurrence matrix have several problems, high latitude, sparse and model is not robust. A simple way is to do the Singular Value Decomposition of the co-occurrence matrix to obtain Low dense dimensional vectors.

Practical voiceover: To calculate the PMI value of each word pair in the corpus, the intuitive idea is to first statistically construct the word-word matrix and then convert it into a joint probability matrix, and finally calculate the PPMI matrix. If you want to use vector, you can use SVD to reduce the dimension first.

Word Embeddings

Word vectors based on TF-IDF and PPMI are long and sparse, which is not conducive to storage and calculation. Now commonly used word vectors are short and dense Word Embeddings:

short (length 50-1000)
dense (most elements are non-zero)
Word Embeddings is based on the theoretical basis of Vector Semantics: distributional hypothesis, a word’s meaning is given by the words that frequently appear close-by.

It should learn to encode similarity in the vectors themselves. The coding goal of word embedding is to encode word similarity, and all optimization goals and actual use revolve around similarity.

Two terms are clarified here:

Distributional representations: Represents the idea of distributional hypothesis, the opposite is similar to WordNet to directly construct relationships from independent words.

Distributed representations: Represents that words are represented by vectors, and the meaning of words is scattered in each dimension. The opposite is one-hot vector, which has only one discrete value.


The most classic word embedding Word2vec (Mikolov et al. 2013) is a framework for calculating word vectors. The main idea is:

Prepare a very large text corpus;
The words in each vocabulary are represented as a fixed-length vector;
Traverse every word in corpus with position t, center word is c, context words is o
Calculate conditional probability based on word vector similarity;
Constantly adjust word vectors to maximize this conditional probability.

■ Objective Function

The objective function of Word2vec is negative log likelihood.

The training goal is to minimize the objective function, and the conditional probability is converted into a probability distribution using cosine similarity plus softmax.

Where o/c is the index in vocabulary, and the probability of predicting o according to c is the prediction function of Word2vec.

■ Train Word2vec

The trainable parameters included in the Word2vec model are as follows.

Training Word2vec requires computing the gradients of all vectors.

To calculate the partial derivative, first decompose the original formula into 2 parts.

Therefore, in combination, the partial derivative of the context probability of word o and word c to the center vector can be obtained:

It can be seen that the optimization goal is to make the actual vector approach the expected vector of all possible context vectors.

Some suggestions for practical training:

Each word corresponds to two vectors for the convenience of optimization, and finally the average of the two vectors is used as the final word vector;

The original paper has two models:

Skip-grams (sg): Use the central word to predict the surrounding words, sg can handle rare words better than cbow.

Continuous Bag of Words (cbow): Use the surrounding words to predict the central word, cbow is faster than sg training.

The normalization factor calculation of the denominator part in softmax is very time-consuming. Negative sampling is used in paper, and logistic regressions are used to train a two-class classification to distinguish real (center word, context word) word pairs from random sampling constructed fake (center word, random non -context word) word pair, the formula is as follows:

Change to the same negative form as before:

The k negative sampled words obey the probability distribution Z refers to normalization). The purpose of modifying the word frequency probability distribution is to increase the possibility of low-frequency words being sampled, which is similar to the optimization technology in the PMI chapter.

Practical voiceover: To train Word2vec, the intuitive idea is to first prepare all word pairs from the corpus, including valid (c,o) and negatively sampled (c,random), and then calculate the loss function based on cosine similarity, and then use For sgd optimization parameters, it is best to average the center and context vectors as the final embedding.

Other Improved Embeddings

■ GloVe: Count-based plus Prediction-based

The calculation method of Word embedding can be divided into count-based (PMI matrix) and prediction-based (Word2vec).

Both have their own advantages. Count-based makes full use of global statistical information and trains faster, while prediction-based is more capable of handling massive amounts of data and can capture more patterns other than word similarity.

In fact, there is a strong internal correlation between the two. After all, the theoretical basis is the distributional hypothesis. The matrix after the point multiplication of the center word matrices WW and context word matrices CC of the Skip-gram model can be factorized into a PMI matrix minus one Constants related to the number of negative samples kk (Levy and Goldberg 2014):

The GloVe vector proposed by Stanford combines the advantages of count-based and prediction-based methods. The core idea is that the ratio of co-occurrence conditional probability can more significantly reflect the similarity of words:

The ratio of co-occurrence conditional probabilities can be expressed as a linear representation of the word vector space:

The final loss function is:

Relevant thinking: In scenarios where there is mutual comparison, the specific value is often of little significance, but the ratio of the comparison is more statistically significant. The core idea of GloVe is based on this, and the optimized version of the very strong baseline tfidf is also the feature extraction algorithm NBSVM that counts the importance of a word to a category.

■ FastText: Sub-word Embeddings

The previous Embedding generation methods ignore the inflection of words, and each word is an independent vector, which is powerless to deal with OOV. FastText is an improved version of Skip-gram, using sub-word information, each word will be represented as bag of character n-grams, and the final word vector is the sum of each gram embedding.

For example, where(n=3) is expressed as adding the word itself, and < and > represent the beginning and end of the word respectively, and are used to distinguish the position of the subsequence. In the n-grams collection, 3-6 n-grams are generally used. It should be noted that her is different from tri-gram. The advantage of FastText is that the training speed is very, very fast, and the effect is better than Word2vec. At the same time, it can also calculate the word vector of OOV words, which can be used as the first candidate word vector for project startup.

Evaluating Embeddings
■ Intrinsic vs extrinsic

Word vector evaluation is divided into intrinsic evaluation (intrinsic) and external evaluation (extrinsic). External evaluation needs to be placed in various downstream tasks to see the actual effect. The usage methods are fixed pre-trained, fine-tuned pre-trained, multichannel Or concatenation, etc., see textcnn paper for details. There are several types of internal assessments:

(1) Analogy: Through intuitive semantic analogy assessment, find x to satisfy "a is to b, as x is to y"

(2) Through visual evaluation, commonly used techniques are PCA and t-SNE (non-linear projection)

Comparison of the effect of pca and t-SNE:

(3) Categorization: Cluster the trained embeddings, according to the standard classification set or manual verification, to see whether each cluster category is good or bad.

■ When are Pre-trained Embeddings Useful?

Simply put, it is more useful when the training corpus for specific tasks is relatively scarce, as follows:

Very useful: tagging, parsing, text classification
Less useful: machine translation
Basically useless: language modeling
Sentence Embeddings/Encoder
Sentence Embeddings and Word Embeddings are very similar, and the training methods can be roughly divided into bag-of-words models based on pure statistics and NN models based on the distributional hypothesis of the sentence dimension. The role of Sentence Representation:

Sentence Classification feature input for text classification;
Paraphrase Identification judges whether the paragraphs are similar;
Semantic Similarity Whether the semantics of the two sentences are similar;
Natural Language Inference Entailment/Contradiction/Neutral;
Retrieval Similarity retrieval based on sentence vectors.

Baseline Bow Models

The basic model of Sentence Embedding sentence embedding can be a statistical bag-of-words model similar to the TF-IDF weighting matrix mentioned above, or a bag-of-words model based on word vectors. The weighting method of the word vector generally follows a rule: the more common the word, the smaller the weight. Here is a brief introduction to one, SIF (Smooth Inverse Frequency), a simple but effective weighted word bag model, whose performance exceeds that of a simple RNN/CNN model. The calculation of SIF is divided into two steps:

For each word vector in the sentence, multiply a weight, where a is a constant (usually 0.0001), p(w) is the word frequency of the word in the global corpus, and the higher the frequency of occurrence, the smaller the weight .
Calculate the first principal component u of the sentence vector matrix, and let each sentence vector subtract its projection on u (similar to PCA);
The complete algorithm flow is as follows:

SIF removes the high-frequency words and syntactic structures that are not very relevant to the sentence meaning, and retains the part that contributes the most to the sentence sense information.


Skip-Thought Vector uses an idea similar to Word2vec/language model, with Sentence Embedding as a by-product of the model. Given a triple represents 3 consecutive sentences. The model uses the Encoder-Decoder framework. During training, the Encoder pairs are encoded, and then two Decoders are used to generate the previous sentence and the next sentence, as shown in the figure below:

Decoder is a language model for a given condition (sentence representation), the probability of each word:


Quick-Thought is an upgraded version of Skip-Thought. The task of generating the previous sentence and the next sentence given a sentence above is redescribed as a classification task: Decoder acts as a classifier to select the correct previous/next sentence from a set of candidate sentences A sentence. Skip-Thought can be understood as a generative model:

Quick-Thought is a classification model.

Skip-Thought will be trained to reconstruct the surface structure of the target sentence (the specific combination of words), resulting in the model not only learning to predict the semantics of the sentence, but also predicting the specific composition of the sentence that has nothing to do with semantics. The same semantic sentence can be expressed in many ways, and even the surface structure is completely dissimilar, so the Skip-Thought model of such sentences will consider them dissimilar.

Quick-Thought defines the loss function directly in the semantic space after sentence vectorization, which is better than Skip-Thought directly defined in the raw data space, and the optimization goal is more focused. The loss function is similar to the Skip-gram model. After negative sampling, cosine similarity is used to define the similarity directly, and softmax is used for normalization, but it is multi-classification.

The simplification of the top-level classifier is to allow the bottom-level encoder to learn more representations related to semantics. It should learn to encode similarity in the vectors themselves. It can be clearly seen that Quick-Thought is a general sentence vector learning framework. The GRU used by the underlying encoder of the original text can also be replaced with others, such as Transformer to increase the representation ability. In the actual prediction stage, the outputs of the bottom two encoders are spliced together as the final sentence vector representation.

In our actual work, Quick-Thought is also commonly used, because it also has the advantage that the training speed is much faster than Skip-Thought Vector, which needs to train 3 RNN modules. Some specific implementation details:

batch_size=400, that is, a batch is 400 consecutive sentences;
context_size=3, that is, for a given sentence, the previous sentence and the next sentence are considered similar (distribution hypothesis of the sentence dimension);
Negative sampling: In the same batch, except for contextual sentences, all are used as negative examples. Experiments have proved that such a simple strategy is similar to other common random negative sampling strategies;
Word vectors or the entire encoder can be pre-trained to speed up training.
Other Sentence Embeddings

The unsupervised general sentence vector training method mentioned above is more convenient to implement in the industry. There are also some sentence vector models based on supervised learning (InferSent) and multi-task learning (GenSen) in academia. The performance can be improved, but training The data is not so easy to obtain, generally you can use transfer learning to land in your work, here is a brief introduction to some of the more classic ones.

■ InferSent (Facebook)

InferSent uses a supervised approach to train Sentence Embedding on Natural Language Inference (NLI) datasets. The paper also demonstrates that the sentence vectors trained on the NLI dataset are also suitable for transfer to other NLP tasks.

In the NLI task, each sample consists of three elements - where u represents the premise (premise), v represents the hypothesis (hypothesis), l is the class label (entailment 1, contradiction 2, neutral 3), and the u of each sample The status of and v is not equivalent. The network structure of InferSent is as follows:

The sentence encoder can be selected according to the specific needs, the underlying encoder is shared, input the premise and hypothesis to output the sentence representation u and v respectively, and then use the three very common methods in the matching model to extract the relationship between the two: vector splicing (u, v) element -wise multiply u∗v, and element-wise subtract to get the absolute value |u−v|. The last is a three-class classifier that predicts the corresponding value of l.

■ GenSen (Microsoft)

The core idea of GenSen is that in order to be able to generalize to various tasks, multiple aspects of the same sentence need to be encoded. Simply put, the model is trained on multiple tasks and multiple data sources at the same time, but shares the same Sentence Embedding. Tasks and datasets include:

Skip-Thought (predict previous/next sentence) - BookCorpus
Neural Machine Translation (NMT) - En-Fr (WMT14) + En-De (WMT15)
Natural Language Inference (NLI) - SNLI + MultiNLI
Constituency Parsing——PTB + 1-billion word
The basic model is similar to Skip-Thought Vector, the Encoder part uses Bi-GRU for speed, and the Decoder part is exactly the same.

SemAxis: Synonym Expansion
After training to obtain pre-trained word vectors or sentence vectors, a direct application is to find synonyms/synonymous sentences through vector similarity. Here is a simple and effective synonym expansion method SemAxis (ACL 2018):

Prepare pre-trained word vectors;
In the word vector space, the semantic axis vector is calculated through the semantic seed word vector, which represents positive words and negative words.

In the word vector space, map other word word vectors onto the computed semantic axis.

In our actual work, the effect of expanding emotional words is as follows, the circled ones are seed words, and the others are expanded.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us