Skip to content
Related Articles

Related Articles

Improve Article
Save Article
Like Article

Create Inverted Index for File using Python

  • Difficulty Level : Medium
  • Last Updated : 29 Dec, 2020

An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a document or a set of documents. In simple words, it is a hashmap like data structure that directs you from a word to a document or a web page.

Creating Inverted Index

We will create a Word level inverted index, that is it will return the list of lines in which the word is present. We will also create a dictionary in which key values represent the words present in the file and the value of a dictionary will be represented by the list containing line numbers in which they are present. To create a file in Jupiter notebook use magic function:

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning - Basic Level Course

%%writefile file.txt
This is the first word.
This is the second text, Hello! How are you?
This is the third, this is it now.

This will create a file named file.txt will the following content.

To read file: 


# this will open the file
file = open('file.txt', encoding='utf8')
read =
# to obtain the
# number of lines
# in file
line = 1
for word in read:
    if word == '\n':
        line += 1
print("Number of lines in file is: ", line)
# create a list to
# store each line as
# an element of list
array = []
for i in range(line):


Number of lines in file is: 3
['This is the first word.\n',
'This is the second text, Hello! How are you?\n',
'This is the third, this is it now.']

Functions used:

  • Open: It is used to open the file.
  • read: This function is used to read the content of the file.
  • seek(0): It returns the cursor to the beginning of the file.

Remove punctuation: 


punc = '''!()-[]{};:'"\, <>./?@#$%^&*_~'''
for ele in read:  
    if ele in punc:  
        read = read.replace(ele, " ")  
# to maintain uniformity


'this is the first word \n
this is the second text hello how are you \n
this is the third this is it now '

Clean data by removing stopwords: 

Stop words are those words that have no emotions associated with it and can safely be ignored without sacrificing the meaning of the sentence.


from nltk.tokenize import word_tokenize
import nltk
from nltk.corpus import stopwords'stopwords')
for i in range(1):
    # this will convert
    # the word into tokens
    text_tokens = word_tokenize(read)
tokens_without_sw = [
    word for word in text_tokens if not word in stopwords.words()]


['first', 'word', 'second', 'text', 'hello', 'third']

Create an inverted index:


dict = {}
for i in range(line):
    check = array[i].lower()
    for item in tokens_without_sw:
        if item in check:
            if item not in dict:
                dict[item] = []
            if item in dict:


{'first': [1],
'word': [1],
'second': [2], 
'text': [2], 
'hello': [2], 
'third': [3]}

My Personal Notes arrow_drop_up
Recommended Articles
Page :