Most Asked NLP Interview Questions from fastText

Data Alt Labs
6 min readJun 24, 2022

What is fastText?

FastText is an open-source library, developed by the Facebook AI Research lab. Its main focus is on achieving scalable solutions for the tasks of text classification and representation while processing large datasets quickly and accurately. It's a fast and efficient method to perform the same task and because of the nature of its training method, it ends up learning morphological details as well.

fastText is unique because it can derive word vectors for unknown words or out-of-vocabulary words — this is because by taking morphological characteristics of words into account, it can create the word vector for an unknown word. Since morphology refers to the structure or syntax of the words, FastText tends to perform better for such task, and word2vec perform better for the semantic task.

Note: FastText works well with rare words, So even if a word wasn’t seen during training, it can be broken down into n-grams to get its embeddings.

What are the Uses of fastText?

  1. It is used for finding semantic similarities
  2. It can also be used for text classification(ex: spam filtering).
  3. It can train large datasets in minutes.

How fastText works in Unsupervised Dataset?

When we have unlabeled dataset FastText uses the N-Gram Technique to train the model. Let us understand more in detail how this technique works-

FastText represents each word as an n-gram of characters. So, for example, take the word, “artificial” with n=3, the fastText representation of this word is <ar, art, rti, tif, ifi, fic, ici, ial, al>, where the angular brackets indicate the beginning and end of the word. fastText learns the weights for every n-gram along with the entire word token.

In this manner, each token/word will be expressed as the sum and an average of its n-gram components.

  • Word vectors generated through fastText hold extra information about their sub-words. As in the above example, we can see that one of the components for the word “kingdom” is the word “king”. This information helps the model build semantic similarity between the two words.
  • It also allows for capturing the meaning of suffixes/prefixes for the given words in the corpus.
  • It allows for generating better word embeddings for different or rare words as well.
  • It can also generate word embeddings for out of vocabulary(OOV) words.
  • While using fastText even if you don’t remove the stopwords still the accuracy is not compromised. You can perform simple pre-processing steps on your corpus if you fell like.
  • As fastText has the feature of providing sub-word information, it can also be used on morphologically rich languages like Spanish, French, German, etc.

FastText also represents a text by a low dimensional vector, which is obtained by summing vectors corresponding to the words appearing in the text. In fastText, a low dimensional vector is associated to each word of the vocabulary. This hidden representation is shared across all classifiers for different categories, allowing information about words learned for one category to be used by other categories. These kind of representations, called bag of words, ignore word order. In fastText we also use vectors to represent word ngrams to take into account local word order, which is important for many text classification problems.

fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation. With fastText, we were often able to cut training times from several days to just a few seconds, and achieve state-of-the-art performance on many standard problems, such as sentiment analysis or tag prediction.

Note: Word2vec and GloVe both fail to provide any vector representation for words that are not in the model dictionary. This is a huge advantage of this method.

How fastText works in Supervised Dataset?

A Softmax function is often used as an activation function to output the probability of a given input to belong to k classes in supervised classification problems.

Linear classifier: In this text and labels are represented as vectors. We find vector representations such that text and it’s associated labels have similar vectors. In simple words, the vector corresponding to the text is closer to its corresponding label.

To find the probability score of a correct label given it’s associated text we use the softmax function:

Here travel is the label and car is the text associated to it.

To maximize this probability of the correct label we can use the Gradient Descent algorithm. This is quite computationally expensive because for every piece of text not only we have to get the score associated with its correct label but we need to get the score for every other label in the training set. This limits the use of these models on very large datasets.

Note: FastText solves this problem by using a hierarchical classifier to train the model.

Hierarchical Classifier: Hierarchical Softmax proves to be very efficient when there are a large number of categories and there is a class imbalance present in the data. Here, the classes are arranged in a tree distribution instead of a flat, list-like structure.

The construction of the hierarchical softmax layer is based on the Huffman coding tree, which uses shorter trees to represent more frequently occurring classes and longer trees for rarer, more infrequent classes.

The probability that a given text belongs to a class is explored via a depth-first search along the nodes across the different branches. Therefore, branches (or equivalently, classes) with low probability can be discarded away.

For data where there are a huge number of classes, this will result in a highly reduced order of complexity, thereby speeding up the classification process significantly compared to traditional models.

Note: FastText solves this problem by using a hierarchical classifier to train the model.

What are the drawbacks of fastText?

One of the major drawback of this model is high memory requirement. Since this model creates word-embedding from its characters and not from words. We can control the number of character embeddings by using the maximum and minimum ngrams. It has been seen that for a corpus size of 50 million unique words the requires ram size can be as much as 256 GB RAM. Hence in order to get rid off this problem we can make use of a hyper-parameter called min word count, which can be increased to ignore words below a certain threshold.

In order to control its memory requirement we hash our n-grams to mapping function that has values in between 1 to K. The function that is used to hash these values is known as ‘Fowler-Noll Vo Hashing Function’. However as the size of corpus grows the number of n-grams that get mapped to same bucket also increases. The bucket-size represents the total size of arrays available for all n-grams. We can increase this value in order to decrease the number of n-grams getting hashed into the same bucket. In the below image we can see how the fasttext maps ngrams into bucket whose sizw is 10.

Conclusion

FastText, learns vectors for the n-grams that are found within each word, as well as each complete word. At each training step in FastText, the mean of the target word vector and its component n-gram vectors are used for training. The adjustment that is calculated from the error is then used uniformly to update each of the vectors that were combined to form the target. This adds a lot of additional computation to the training step. At each point, a word needs to sum and average its n-gram component parts. The trade-off is a set of word-vectors that contain embedded sub-word information.

Reference

https://towardsdatascience.com/fasttext-for-text-classification-a4b38cbff27c

https://www.geeksforgeeks.org/fasttext-working-and-implementation/

--

--