tf-idf Model for Page Ranking

tf-idf stands for Term frequency-inverse document frequency. The tf-idf weight is a weight often used in information retrieval and text mining. Variations of the tf-idf weighting scheme are often used by search engines in scoring and ranking a document’s relevance given a query. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus (data-set).

How to Compute:
tf-idf is a weighting scheme that assigns each term in a document a weight based on its term frequency (tf) and inverse document frequency (idf). The terms with higher weight scores are considered to be more important.

Typically, the tf-idf weight is composed by two terms-



  1. Normalized Term Frequency (tf)
  2. Inverse Document Frequency (idf)

Let’s us take 3 documents to show how this works.

Doc 1: Ben studies about computers in Computer Lab.
Doc 2: Steve teaches at Brown University.
Doc 3: Data Scientists work on large datasets.

Let’s say we are doing a search on these documents with the following query: Data Scientists

The query is a free text query. It means a query in which the terms of the query are typed freeform into the search interface, without any connecting search operators.

Step 1: Computing the Term Frequency(tf)

Frequency indicates the number of occurences of a particular term t in document d. Therefore,

tf(t, d) = N(t, d), wherein tf(t, d) = term frequency for a term t in document d.

N(t, d)  = number of times a term t occurs in document d

We can see that as a term appears more in the document it becomes more important, which is logical.We can use a vector to represent the document in bag of words model, since the ordering of terms is not important. There is an entry for each unique term in the document with the value being its term frequency.

Given below are the terms and their frequency on each of the document. [N(t, d)]

tf for document 1:

Doc 1 Ben Studies Computer Lab
tf 1 1 2 1

Vector Space Representation for Doc 1 : [1, 1, 2, 1]

tf for document 2:


Doc 2 Steve teaches Brown University
tf 1 1 1 1

Vector Space Representation for Doc 2 : [1, 1, 1, 1]

tf for document 3:

Doc 3 Data Scientists work large datasets
tf 1 1 1 1 1

Vector Space Representation for Doc 3 : [1, 1, 1, 1, 1]

Thus, the representation of documents as vectors in a common vector space is known as Vector Space Model and it’s very fundamental to information retrieval.

Since we are dealing with the term frequency which rely on the occurrence counts, thus, longer documents will be favored more. To avoid this, normalize the term frequency

tf(t, d) = N(t, d) / ||D||
wherein, ||D|| = Total number of term in the document

||D|| for each document:

Documents ||D||
1 7
2 5
3 6

Given below are the normalized term frequency for all the documents, i.e. [N(t, d) / ||D||]

Normalized TF for Document 1:

Doc1 Ben studies Computer Lab
Normalized Tf 0.143 0.143 0.286 0.143

Vector Space Representation for Document 1 : [0.143, 0.143, 0.286, 0.143]

Normalized tf for document 2:


Doc 2 Steve teaches Brown University
NormalizedTf 0.2 0.2 0.2 0.2

Vector Space Representation for Document 2 : [0.2, 0.2, 0.2, 0.2]

Normalized tf for document 3:

Doc 3 Data Scientists work large datasets
NormalizedTf 0.167 0.167 0.167 0.167 0.167

Vector Space Representation for Document 3 : [0.167, 0.167, 0.167, 0.167, 0.167]

Below function in Python will do the normalized TF calculation:

filter_none

edit
close

play_arrow

link
brightness_4
code

def termFrequency(term, doc): 
      
    """
    Input: term: Term in the Document, doc: Document
    Return: Normalized tf: Number of times term occurs
      in document/Total number of terms in the document
    """
    # Splitting the document into individual terms
    normalizeTermFreq = doc.lower().split() 
  
    # Number of times the term occurs in the document
    term_in_document = normalizeTermFreq.count(term.lower()) 
  
    # Total number of terms in the document
    len_of_document = float(len(normalizeTermFreq )) 
  
    # Normalized Term Frequency
    normalized_tf = term_in_document / len_of_document 
  
    return normalized_tf

chevron_right


Step 2: Compute the Inverse Document Frequency – idf

It typically measures how important a term is. The main purpose of doing a search is to find out relevant documents matching the query. Since tf considers all terms equally important, thus, we can’t only use term frequencies to calculate the weight of a term in the document. However, it is known that certain terms, such as “is”, “of”, and “that”, may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scaling up the rare ones. Logarithms helps us to solve this problem.

First of all, find the document frequency of a term t by counting the number of documents containing the term:

df(t) = N(t)

where-
df(t) = Document frequency of a term t
N(t) = Number of documents containing the term t

Term frequency is the occurrence count of a term in one particular document only; while document frequency is the number of different documents the term appears in, so it depends on the whole corpus. Now let’s look at the definition of inverse document frequency. The idf of a term is the number of documents in the corpus divided by the document frequency of a term.

idf(t) = N/ df(t) = N/N(t)

It’s expected that the more frequent term to be considered less important, but the factor (most probably integers) seems too harsh. Therefore, we take the logarithm (with base 2 ) of the inverse document frequencies. So, the idf of a term t becomes :

idf(t) = log(N/ df(t))

This is better, and since log is a monotonically increasing function we can safely use it. Let’s compute IDF for the term Computer:

idf(computer) = log(Total Number Of Documents / Number Of Documents with term Computer in it)

There are 3 documents in all = Document1, Document2, Document3


The term Computer appears in Document1

idf(computer) = log(3 / 1)
          = 1.5849

Given below is the idf for terms occurring in all the documents-

Given No. of documents in which term appears(Nt) idf = Log(N/Nt)
Ben 1 Log(3/1)=1.5849
Studies 1 Log(3/1)=1.5849
Computer 1 Log(3/1)=1.5849
Lab 1 Log(3/1)=1.5849
Steve 1 Log(3/1)=1.5849
Teaches 1 Log(3/1)=1.5849
Brown 1 Log(3/1)=1.5849
University 1 Log(3/1)=1.5849
Data 1 Log(3/1)=1.5849
Scientists 1 Log(3/1)=1.5849
Work 1 Log(3/1)=1.5849
Large 1 Log(3/1)=1.5849
Dataset 1 Log(3/1)=1.5849

Given below is the function in python to calculate idf:

filter_none

edit
close

play_arrow

link
brightness_4
code

def inverseDocumentFrequency(term, allDocs):
    num_docs_with_given_term = 0
  
    """
    Input: term: Term in the Document,
           allDocs: List of all documents
    Return: Inverse Document Frequency (idf) for term
            = Logarithm ((Total Number of Documents) / 
            (Number of documents containing the term))
    """
    # Iterate through all the documents
    for doc in allDocs:
          
        """
        Putting a check if a term appears in a document.
        If term is present in the document, then 
        increment "num_docs_with_given_term" variable
        """
        if term.lower() in allDocs[doc].lower().split():
            num_docs_with_given_term += 1
  
    if num_docs_with_given_term > 0:
        # Total number of documents
        total_num_docs = len(allDocs) 
  
        # Calculating the IDF 
        idf_val = log(float(total_num_docs) / num_docs_with_given_term)
        return idf_val
    else:
        return 0

chevron_right


Step 3: tf-idf Scoring

Now we have defined both tf and idf and now we can combine these to produce the ultimate score of a term t in document d. Therefore,

tf-idf(t, d) = tf(t, d)* idf(t, d)

For each term in the query multiply its normalized term frequency with its IDF on each document. In Document3 for the term data, the normalized term frequency is 0.167 and its IDF is 1.5849. Multiplying them together we get 0.2646. Given below is TF * IDF calculations for data and Scientists in all the documents.

Doc 1 Doc 2 Doc 3
Data 0 0 0.2646
Scientists 0 0 0.2646

We will use any of the similarity measures (eg, Cosine Similarity method) to find the similarity between the query and each document. For example, if we use Cosine Similarity Method to find the similarity, then smallest the angle, the more is the similarity.

Using the formula given below we can find out the similarity between any two documents, let’s say d1, d2.

Cosine Similarity (d1, d2) =  Dot product(d1, d2) / ||d1|| * ||d2||

Dot product (d1, d2) = d1[0] * d2[0] + d1[1] * d2[1] * … * d1[n] * d2[n]
||d1|| = square root(d1[0]^2 + d1[1]^2 + ... + d1[n]^2)
||d2|| = square root(d2[0]^2 + d2[1]^2 + ... + d2[n]^2)

 

References:



My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.



Improved By : kapil01agnihotri