Open In App

TF – IDF for Bigrams & Trigrams

Last Updated : 27 Sep, 2019
Improve
Improve
Like Article
Like
Save
Share
Report

TF-IDF in NLP stands for Term Frequency – Inverse document frequency. It is a very popular topic in Natural Language Processing which generally deals with human languages. During any text processing, cleaning the text (preprocessing) is vital. Further, the cleaned data needs to be converted into a numerical format where each word is represented by a matrix (word vectors). This is also known as word embedding
Term Frequency (TF) = (Frequency of a term in the document)/(Total number of terms in documents)
Inverse Document Frequency(IDF) = log( (total number of documents)/(number of documents with term t))
TF.IDF = (TF).(IDF)

Bigrams: Bigram is 2 consecutive words in a sentence. E.g. “The boy is playing football”. The bigrams here are:

The boy
Boy is
Is playing
Playing football

Trigrams: Trigram is 3 consecutive words in a sentence. For the above example trigrams will be:

The boy is
Boy is playing
Is playing football

From the above bigrams and trigram, some are relevant while others are discarded which do not contribute value for further processing.
Let us say from a document we want to find out the skills required to be a “Data Scientist”. Here, if we consider only unigrams, then the single word cannot convey the details properly. If we have a word like ‘Machine learning developer’, then the word extracted should be ‘Machine learning’ or ‘Machine learning developer’. The words simply ‘Machine’, ‘learning’ or ‘developer’ will not give the expected result.

Code – Illustrating the detailed explanation for trigrams




# Importing libraries
import nltk
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import pandas as pd
  
# Input the file 
txt1 = []
with open('C:\\Users\\DELL\\Desktop\\MachineLearning1.txt') as file:
    txt1 = file.readlines()
  
# Preprocessing
def remove_string_special_characters(s):
      
    # removes special characters with ' '
    stripped = re.sub('[^a-zA-z\s]', '', s)
    stripped = re.sub('_', '', stripped)
      
    # Change any white space to one space
    stripped = re.sub('\s+', ' ', stripped)
      
    # Remove start and end white spaces
    stripped = stripped.strip()
    if stripped != '':
            return stripped.lower()
          
# Stopword removal 
stop_words = set(stopwords.words('english'))
your_list = ['skills', 'ability', 'job', 'description']
for i, line in enumerate(txt1):
    txt1[i] = ' '.join([x for 
        x in nltk.word_tokenize(line) if 
        ( x not in stop_words ) and ( x not in your_list )])
      
# Getting trigrams 
vectorizer = CountVectorizer(ngram_range = (3,3))
X1 = vectorizer.fit_transform(txt1) 
features = (vectorizer.get_feature_names())
print("\n\nFeatures : \n", features)
print("\n\nX1 : \n", X1.toarray())
  
# Applying TFIDF
vectorizer = TfidfVectorizer(ngram_range = (3,3))
X2 = vectorizer.fit_transform(txt1)
scores = (X2.toarray())
print("\n\nScores : \n", scores)
  
# Getting top ranking features
sums = X2.sum(axis = 0)
data1 = []
for col, term in enumerate(features):
    data1.append( (term, sums[0,col] ))
ranking = pd.DataFrame(data1, columns = ['term','rank'])
words = (ranking.sort_values('rank', ascending = False))
print ("\n\nWords head : \n", words.head(7))


Output:

Features : 
 ['10 experience working', '11 exposure implementing', 'able work minimal',
 'accounts commerce added', 'analysis recognition face', 'analytics contextual image',
 'analytics nlp ensemble', 'applying data science', 'bagging boosting text',
 'beyond existing learn', 'boosting text analytics', 'building using logistics',
 'building using supervised', 'classification facial expression', 
 'classifier deep learning', 'commerce added advantage',
 'complex engineering analysis', 'contextual image processing',
 'creative projects work', 'data science problem', 'data science solutions',
 'decisions report progress', 'deep learning analytics', 'deep learning framework',
 'deep learning neural', 'demonstrated development role', 'demonstrated leadership role',
 'description machine learning', 'detection tracking classification',
 'development role machine', 'direction project less', 'domains essential position',
 'domains like healthcare', 'ensemble classifier deep', 'existing learn quickly',
 'experience object detection', 'experience working multiple',
 'experienced technical personnel', 'expertise visualizing manipulating',
 'exposure implementing data', 'expression analysis recognition',
 'extensively worked python', 'face iris finger', 'facial expression analysis',
 'finance accounts commerce', 'forest bagging boosting', 'framework tensorflow keras',
 'good oral written', 'guidance direction project', 'guidance make decisions',
 'healthcare finance accounts', 'implementing data science', 'including provide guidance',
 'innovative creative projects', 'iris finger gesture', 'job description machine',
 'keras or pytorch', 'leadership role projects', 'learn quickly new',
 'learning analytics contextual', 'learning framework tensorflow',
 'learning neural networks', 'learning projects including', 'less experienced technical',
 'like healthcare finance', 'linear regression svm', 'logistics regression linear',
 'machine learning developer', 'machine learning projects', 'make decisions report',
 'manipulating big datasets', 'minimal guidance make', 'model building using',
 'motivated able work', 'multiple domains like', 'must self motivated',
 'new domains essential', 'nlp ensemble classifier', 'object detection tracking',
 'oral written communication', 'perform complex engineering', 'problem solving proven',
 'problem statements bring', 'proficiency deep learning', 'proficiency problem solving',
 'project less experienced', 'projects including provide', 'projects work spare',
 'proven perform complex', 'proven record working', 'provide guidance direction',
 'quickly new domains', 'random forest bagging', 'recognition face iris',
 'record working innovative', 'regression linear regression', 'regression svm random',
 'role machine learning', 'role projects including', 'science problem statements',
 'science solutions production', 'self motivated able', 'solutions production environments',
 'solving proven perform', 'spare time plus', 'statements bring insights',
 'supervised unsupervised algorithms', 'svm random forest', 'tensorflow keras or',
 'text analytics nlp', 'tracking classification facial', 'using logistics regression',
 'using supervised unsupervised', 'visualizing manipulating big', 'work minimal guidance',
 'work spare time', 'working innovative creative', 'working multiple domains']


X1 : 
 [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


Scores : 
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


Words head : 
                             term      rank
41     extensively worked python  1.000000
79    oral written communication  0.707107
47             good oral written  0.707107
72          model building using  0.673502
27  description machine learning  0.577350
70     manipulating big datasets  0.577350
67    machine learning developer  0.577350

Now, if w do it for bigrams then the initial part of code will remain the same. Only the bigram formation part will change.
Code : Python code for implementing bigrams




# Getting bigrams 
vectorizer = CountVectorizer(ngram_range =(2, 2))
X1 = vectorizer.fit_transform(txt1) 
features = (vectorizer.get_feature_names())
print("\n\nX1 : \n", X1.toarray())
  
# Applying TFIDF
# You can still get n-grams here
vectorizer = TfidfVectorizer(ngram_range = (2, 2))
X2 = vectorizer.fit_transform(txt1)
scores = (X2.toarray())
print("\n\nScores : \n", scores)
  
# Getting top ranking features
sums = X2.sum(axis = 0)
data1 = []
for col, term in enumerate(features):
    data1.append( (term, sums[0, col] ))
ranking = pd.DataFrame(data1, columns = ['term', 'rank'])
words = (ranking.sort_values('rank', ascending = False))
print ("\n\nWords : \n", words.head(7))


Output:

X1 : 
 [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


Scores : 
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


Words : 
                     term      rank
50   great interpersonal  1.000000
110     skills abilities  1.000000
23         deep learning  0.904954
72      machine learning  0.723725
21          data science  0.723724
128        worked python  0.707107
42    extensively worked  0.707107

Likewise, we can obtain the TF IDF scores for bigrams and trigrams as per our use. These can help us get a better outcome without having to process more on data.



Similar Reads

tf-idf Model for Page Ranking
tf-idf stands for Term frequency-inverse document frequency. The tf-idf weight is a weight often used in information retrieval and text mining. Variations of the tf-idf weighting scheme are often used by search engines in scoring and ranking a document's relevance given a query. This weight is a statistical measure used to evaluate how important a
7 min read
Understanding TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF stands for Term Frequency Inverse Document Frequency of records. It can be defined as the calculation of how relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus (data-set). Terminologies: Term Frequenc
6 min read
Sklearn | Feature Extraction with TF-IDF
Now, you are searching for tf-idf, then you may familiar with feature extraction and what it is. TF-IDF which stands for Term Frequency – Inverse Document Frequency. It is one of the most important techniques used for information retrieval to represent how important a specific word or phrase is to a given document. Let’s take an example, we have a
4 min read
Movie recommender based on plot summary using TF-IDF Vectorization and Cosine similarity
Recommending movies to users can be done in multiple ways using content-based filtering and collaborative filtering approaches. Content-based filtering approach primarily focuses on the item similarity i.e., the similarity in movies, whereas collaborative filtering focuses on drawing a relation between different users of similar choices in watching
6 min read
Best Machine Learning Books for Beginners & Experts [2024]
Alan Turing stated, "What we want is a machine that can learn from experience." And this concept is a reality today in the form of Machine Learning! Generally speaking, Machine Learning involves studying computer algorithms and statistical models for a specific task using patterns and inference instead of explicit instructions. And there is no doub
10 min read
Model Complexity & Overfitting in Machine Learning
Model complexity leads to overfitting, which makes it harder to perform well on the unseen new data. In this article, we delve into the crucial challenges of model complexity and overfitting in machine learning. Table of Content What is Model Complexity?Why Model Complexity is Important?What is Model Overfitting?How to Avoid Model Complexity and Ov
5 min read
IBM HR Analytics Employee Attrition & Performance using KNN
Attrition is a problem that impacts all businesses, irrespective of geography, industry and size of the company. It is a major problem to an organization, and predicting turnover is at the forefront of the needs of Human Resources (HR) in many organizations. Organizations face huge costs resulting from employee turnover. With advances in machine le
6 min read
IBM HR Analytics on Employee Attrition & Performance using Random Forest Classifier
Attrition is a problem that impacts all businesses, irrespective of geography, industry and size of the company. It is a major problem to an organization, and predicting turnover is at the forefront of the needs of Human Resources (HR) in many organizations. Organizations face huge costs resulting from employee turnover. With advances in machine le
5 min read
Hinge-loss & relationship with Support Vector Machines
Here we will be discussing the role of Hinge loss in SVM hard margin and soft margin classifiers, understanding the optimization process, and kernel trick. Support Vector Machine(SVM)Support Vector Machine(SVM) is a supervised machine learning algorithm for classification and regression. Let us use the binary classification case to understand the H
12 min read