An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a document or a set of documents. In simple words, it is a hashmap like data structure that directs you from a word to a document or a web page.
Creating Inverted Index
We will create a Word level inverted index, that is it will return the list of lines in which the word is present. We will also create a dictionary in which key values represent the words present in the file and the value of a dictionary will be represented by the list containing line numbers in which they are present. To create a file in Jupiter notebook use magic function:
%%writefile file.txt This is the first word. This is the second text, Hello! How are you? This is the third, this is it now.
This will create a file named file.txt will the following content.
To read file:
Python3
# this will open the file file = open ( 'file.txt' , encoding = 'utf8' ) read = file .read() file .seek( 0 ) read # to obtain the # number of lines # in file line = 1 for word in read: if word = = '\n' : line + = 1 print ( "Number of lines in file is: " , line) # create a list to # store each line as # an element of list array = [] for i in range (line): array.append( file .readline()) array |
Output:
Number of lines in file is: 3 ['This is the first word.\n', 'This is the second text, Hello! How are you?\n', 'This is the third, this is it now.']
Functions used:
- Open: It is used to open the file.
- read: This function is used to read the content of the file.
- seek(0): It returns the cursor to the beginning of the file.
Remove punctuation:
Python3
punc = '''!()-[]{};:'"\, <>./?@#$%^&*_~''' for ele in read: if ele in punc: read = read.replace(ele, " " ) read # to maintain uniformity read = read.lower() read |
Output:
'this is the first word \n this is the second text hello how are you \n this is the third this is it now '
Clean data by removing stopwords:
Stop words are those words that have no emotions associated with it and can safely be ignored without sacrificing the meaning of the sentence.
Python3
from nltk.tokenize import word_tokenize import nltk from nltk.corpus import stopwords nltk.download( 'stopwords' ) for i in range ( 1 ): # this will convert # the word into tokens text_tokens = word_tokenize(read) tokens_without_sw = [ word for word in text_tokens if not word in stopwords.words()] print (tokens_without_sw) |
Output:
['first', 'word', 'second', 'text', 'hello', 'third']
Create an inverted index:
Python3
dict = {} for i in range (line): check = array[i].lower() for item in tokens_without_sw: if item in check: if item not in dict : dict [item] = [] if item in dict : dict [item].append(i + 1 ) dict |
Output:
{'first': [1], 'word': [1], 'second': [2], 'text': [2], 'hello': [2], 'third': [3]}
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.