Open In App

Inverted Index

An Inverted Index is a data structure used in information retrieval systems to efficiently retrieve documents or web pages containing a specific term or set of terms. In an inverted index, the index is organized by terms (words), and each term points to a list of documents or web pages that contain that term.

Inverted indexes are widely used in search engines, database systems, and other applications where efficient text search is required. They are especially useful for large collections of documents, where searching through all the documents would be prohibitively slow.



An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a document or a set of documents. In simple words, it is a hashmap-like data structure that directs you from a word to a document or a web page.

Example: Consider the following documents.



Document 1: The quick brown fox jumped over the lazy dog.
Document 2: The lazy dog slept in the sun.

To create an inverted index for these documents, we first tokenize the documents into terms, as follows.

Document 1: The, quick, brown, fox, jumped, over, the lazy, dog.
Document 2: The, lazy, dog, slept, in, the, sun.

Next, we create an index of the terms, where each term points to a list of documents that contain that term, as follows.

The    -> Document 1, Document 2
Quick -> Document 1
Brown -> Document 1
Fox -> Document 1
Jumped -> Document 1
Over -> Document 1
Lazy -> Document 1, Document 2
Dog -> Document 1, Document 2
Slept -> Document 2
In -> Document 2
Sun -> Document 2

To search for documents containing a particular term or set of terms, the search engine queries the inverted index for those terms and retrieves the list of documents associated with each term. The search engine can then use this information to rank the documents based on relevance to the query and present them to the user in order of importance.

There are two types of inverted indexes:

Suppose we want to search the texts “hello everyone, ” “this article is based on an inverted index, ” and “which is hashmap-like data structure“. If we index by (text, word within the text), the index with a location in the text is:  

 hello                (1, 1)
everyone (1, 2)
this (2, 1)
article (2, 2)
is (2, 3); (3, 2)
based (2, 4)
on (2, 5)
inverted (2, 6)
index (2, 7)
which (3, 1)
hashmap (3, 3)
like (3, 4)
data (3, 5)
structure (3, 6)

The word “hello” is in document 1 (“hello everyone”) starting at word 1, so has an entry (1, 1), and the word “is” is in documents 2 and 3 at ‘3rd’ and ‘2nd’ positions respectively (here position is based on the word). 

The index may have weights, frequencies, or other indicators.

Steps to Build an Inverted Index

Example:

Words                 Document
ant doc1
demo doc2
world doc1, doc2

Implementing Inverted Index




# Define the documents
document1 = "The quick brown fox jumped over the lazy dog."
document2 = "The lazy dog slept in the sun."
 
# Step 1: Tokenize the documents
# Convert each document to lowercase and split it into words
tokens1 = document1.lower().split()
tokens2 = document2.lower().split()
 
# Combine the tokens into a list of unique terms
terms = list(set(tokens1 + tokens2))
 
# Step 2: Build the inverted index
# Create an empty dictionary to store the inverted index
inverted_index = {}
 
# For each term, find the documents that contain it
for term in terms:
    documents = []
    if term in tokens1:
        documents.append("Document 1")
    if term in tokens2:
        documents.append("Document 2")
    inverted_index[term] = documents
 
# Step 3: Print the inverted index
for term, documents in inverted_index.items():
    print(term, "->", ", ".join(documents))

Explanation of the Above Code

The first two lines define two sample documents to be used as input to the algorithm.

Step 1: Tokenize the input documents by converting them to lowercase and splitting them into individual words. Then combine the resulting tokens from both documents into a single list of unique terms.

Step 2: Create an empty dictionary to store the inverted index, and then iterate through each term in the list of unique terms. For each term, create an empty list of documents, and then check if the term appears in each input document.

If the term appears in a document, add the document to the list for that term. Finally, add an entry to the inverted index dictionary for the current term, with the list of documents that contain that term as its value.

Step 3: Iterate through the entries in the inverted index dictionary and print out each term along with the list of documents that contain it.

Output
jumped -> Document 1
fox -> Document 1
lazy -> Document 1, Document 2
the -> Document 1, Document 2
in -> Document 2
dog. -> Document 1
quick -> Document 1
dog -> Document 2
slept -> Document 2
sun. -> Document 2
brown -> Document 1
over -> Document 1



Advantages of Inverted Index

Disadvantages of Inverted Index

Features of Inverted Indexes

FAQs on Inverted Index

1. Why it is called an Inverted Index?

Answer:

It is called an inverted index because it is simply an inversion of the forward index.

2. What is the Difference Between the Inverted Index and the forward Index?

Answer:

The main difference between the forward index and the inverted index is that the forward index is faster in indexing whereas in the inverted index, searching is faster.

3. Where is the Inverted Index used?

Answer:

An inverted Index is a data structure that is generally used in search engines and databases for locating relevant information quickly.


Article Tags :