Most Asked NLP Interview Questions from GloVe

4 min readJun 23, 2022

What is GloVe?

The GloVe stands for Global Vectors for word representation. It is an unsupervised learning algorithm developed by researchers at Stanford University aiming to generate word embeddings by aggregating global word co-occurrence matrices from a given corpus.

The basic idea behind the GloVe word embedding is to derive the relationship between the words from statistics. Unlike the occurrence matrix, the co-occurrence matrix tells you how often a particular word pair occurs together. Each value in the co-occurrence matrix represents a pair of words occurring together.

How do GloVe works?

The GloVe uses a different mechanism and equations to create the embedding matrix. To study GloVe, let’s define the following terms first.

And the ratio of co-occurrence probabilities is as follows:

This ratio gives us some insight into the co-relation of the probe word wk with the word wᵢ and wⱼ.

Given a probe word, the ratio can be small, large or equal to 1 depends on their correlations. For example, if the ratio is large, the probe word is related to wᵢ but not wⱼ. This ratio gives us hints on the relations between three different words. Intuitively, this is somewhere between a bi-gram and a 3-gram.

Now, we want to develop a model for F given some desirable behavior we want for the embedding vector w. As discussed before, linearity is important in the word embedding concept. So if a system is trained on this principle, we should expect that F can be reformulated as:

where we just need to compute the difference and the similarity of word embedding for the parameters in F.

In addition, their relation is symmetrical. (a.k.a. relation(a, b) = relation(b, a)). To enforce such symmetry, we can have

Intuitively, we are maintaining the linear relationship among all these embedding vectors.

To fulfill this relation, F(x) would be an exponential function, i.e. F(x) = exp(x). Combine the last two equations, we get

Since F(x) = exp(x),

We can absorb log(Xᵢ) as a constant bias term since it is invariant of k. But to maintain the symmetrical requirement between i and k, we will split it into two bias terms above. This w and b form the embedding matrix. Therefore, the dot product of two embedding matrices predicts the log co-occurrence count.

Cost Function used in GloVe?

To begin, cross-entropy error is just one among many possible distance measures between probability distributions, and it has the unfortunate property that distributions with long tails are often modeled poorly with too much weight given to the unlikely events. Furthermore, for the measure to be bounded, it requires that the model distribution Q be properly normalized. This presents a computational bottleneck owing to the sum over the whole vocabulary. It would be desirable to consider a different distance measure that did not require this Q property. A natural choice would be a least-squares objective in which normalization factors in Q and P are discarded,

where,

are the unnormalized distributions. At this stage, another problem emerges, namely that Xij often takes very large values, which can complicate the optimization.

An effective remedy is to minimize the squared error of the logarithms of Pˆ and Qˆ instead,

Finally, we observe that while the weighting factor Xi is preordained by the online training method inherent to the skip-gram and ivLBL models, it is by no means guaranteed to be optimal.

In fact, Mikolov et al. (2013a) observe that performance can be increased by filtering the data so as to reduce the effective value of the weighting factor for frequent words. With this in mind, we introduce a more general weighting function, which we are free to take to depend on the context word as well.

The result is,

Conclusion

Vector representation techniques such as GloVe can be used to represent a given corpus with semantic meaning. Additionally, we have seen the main working idea behind the GloVe, which is a co-occurrence matrix, and how GloVe considers the particular word over others based on probabilities.

Reference

https://jonathan-hui.medium.com/nlp-word-embedding-glove-5e7f523999f6

https://nlp.stanford.edu/pubs/glove.pdf