What is FastText?
FastText is an open-source, free library from Facebook AI Research(FAIR) for learning word embeddings and word classifications. This model allows creating unsupervised learning or supervised learning algorithm for obtaining vector representations for words. It also evaluates these models. FastText supports both CBOW and Skip-gram models.
Uses of FastText:
- It is used for finding semantic similarities
- It can also be used for text classification(ex: spam filtering).
- It can train large datasets in minutes.
Working of FastText:
FastText is very fast in training word vector models. You can train about 1 billion words in less than 10 minutes. The models built through deep neural networks can be slow to train and test. These methods use a linear classifier to train the model.
Linear classifier: In this text and labels are represented as vectors. We find vector representations such that text and it’s associated labels have similar vectors. In simple words, the vector corresponding to the text is closer to its corresponding label.
To find the probability score of a correct label given it’s associated text we use the softmax function:
- Here travel is the label and car is the text associated to it.
To maximize this probability of the correct label we can use the Gradient Descent algorithm.
This is quite computationally expensive because for every piece of text not only we have to get the score associated with its correct label but we need to get the score for every other label in the training set. This limits the use of these models on very large datasets.
FastText solves this problem by using a hierarchical classifier to train the model.
Hierarchical Classifier used by FastText:
In this method, it represents the labels in a binary tree. Every node in the binary tree represents a probability. A label is represented by the probability along the path to that given label. This means that the leaf nodes of the binary tree represent the labels.
FastText uses the Huffman algorithm to build these trees to make full use of the fact that classes can be imbalanced. Depth of the frequently occurring labels is smaller than the infrequent ones.
Using a binary tree speed up the time of search as instead of having to go through all the different elements you just search for the nodes. So now we won’t have to compute the score for every single possible label, and we will only be calculating just the probability on each node in the path to the one correct label. Hence this method vastly reduces the time complexity of training the model.
Increasing the speed does not sacrifice the accuracy of the model.
- When we have unlabeled dataset FastText uses the N-Gram Technique to train the model. Let us understand more in detail how this technique works-
Let us consider a word from our dataset, for example: “kingdom”. Now it will take a look at the word “kingdom” and will break it into its n-gram components as-
kingdom = ['k','in','kin','king','kingd','kingdo','kingdom',...]
These are some n-gram components for the given words. There will be many more components for this word but only a few are stated here just to get an idea. The size of the n-gram components can be chosen as per your choice. The length of n-grams can be between the minimum and the maximum number of characters selected. You can do so by using the -minn and -maxn flags respectively.
Note: When your text is not words from a particular language then using n-grams won’t make sense. for example: when the corpus contains ids it will not be storing words but numbers and special characters. In this case, you can turn off the -gram embeddings by selecting the -minn and -maxn parameters as 0.
When the model updates, fastText learns the weights for every n-gram along with the entire word token.
In this manner, each token/word will be expressed as the sum and an average of its n-gram components.
- Word vectors generated through fastText hold extra information about their sub-words. As in the above example, we can see that one of the components for the word “kingdom” is the word “king”. This information helps the model build semantic similarity between the two words.
- It also allows for capturing the meaning of suffixes/prefixes for the given words in the corpus.
- It allows for generating better word embeddings for different or rare words as well.
- It can also generate word embeddings for out of vocabulary(OOV) words.
- While using fastText even if you don’t remove the stopwords still the accuracy is not compromised. You can perform simple pre-processing steps on your corpus if you fell like.
- As fastText has the feature of providing sub-word information, it can also be used on morphologically rich languages like Spanish, French, German, etc.
We do get better word embeddings through fastText but it uses more memory as compared to word2vec or GloVe as it generates a lot of sub-words for each word.
Implementation of FastText
Firstly we will have to build fastText. For doing so follow the steps given below –
In your terminal run the below commands- $ wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip $ unzip v0.9.2.zip $ cd fastText-0.9.2 $ make
After this you need to add the path of its bin folder to system variables then you can use it instead of the make command as-
We have successfully built fastText.
The commands supported by fastText are –
supervised train a supervised classifier quantize quantize a model to reduce the memory usage test evaluate a supervised classifier test-label print labels with precision and recall scores predict predict most likely labels predict-prob predict most likely labels with probabilities skipgram train a skipgram model cbow train a cbow model print-word-vectors print word vectors given a trained model print-sentence-vectors print sentence vectors given a trained model print-ngrams print ngrams given a trained model and word nn query for nearest neighbors analogies query for analogies dump dump arguments,dictionary,input/output vectors
Now I have taken the amazon reviews dataset and saved it as amazon_reviews.txt. You can also perform some pre-processing on your data to get better results.
We will be training a skipgram model. After you are in the fastText-0.9.2 directory, run the below-mentioned command-
$ fasttext skipgram -input amazon_reviews.txt -output model_trained
Here the input file is amazon_reviews.txt. Make sure to give the full path to your file if it is not in the dame directory. model_trained is the name given for the output file.
You can also add other parameters to it explicitly as per your requirement like epos etc. Here we have used the defaults.
It first starts reading the words present in the input document. The document consisted of 32M words and had an ETA of around 15 mins.
It gives detailed statistics of the learning rate of the neural network, how many words are being processed every second on every thread. It also shows the loss value which goes on decreasing as the model is being trained.
After the model is trained we get two files generated i.e. model_trained.bin and model_trained.vec. The .bin file contains the parameters of the model along with the dictionary. This is the file which fasttext uses. The .vec file is a text file which contains the word vectors. This is the file which you will be using in your applications.
We are now going to use our word vectors and perform some operations on it-
1) Finding Nearest Neighbors for a given word
To initialize the nearest neighbor interface execute the following command:
$ fasttext nn model_trained.bin
The interface asks for a query word to which you want to find the nearest neighbors. The output for the query word “brutality” is-
2) Performing Word Analogies
To perform word analogies of the form ( A – B + C ) on words you can execute the below-mentioned command:
$ fasttext analogies model_trained.bin
The word analogies for A = king, B = man, C = woman are:
The first output for the query is “queen” which is the most correct answer possible for this query. Hence, our model trained is quite accurate.
You can also perform other operations like testing your model with a file of test data, making predictions of the correct labels, getting the n-grams for the given words,etc. You can do these by using the above-mentioned commands available in fasttext.