# tf-idf Model for Page Ranking

• Difficulty Level : Medium
• Last Updated : 07 Aug, 2019

tf-idf stands for Term frequency-inverse document frequency. The tf-idf weight is a weight often used in information retrieval and text mining. Variations of the tf-idf weighting scheme are often used by search engines in scoring and ranking a document’s relevance given a query. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus (data-set).

How to Compute:
tf-idf is a weighting scheme that assigns each term in a document a weight based on its term frequency (tf) and inverse document frequency (idf). The terms with higher weight scores are considered to be more important.

Attention reader! Don’t stop learning now. Get hold of all the important Machine Learning Concepts with the Machine Learning Foundation Course at a student-friendly price and become industry ready.

Typically, the tf-idf weight is composed by two terms-

1. Normalized Term Frequency (tf)
2. Inverse Document Frequency (idf)

Let’s us take 3 documents to show how this works.

Doc 1: Ben studies about computers in Computer Lab.
Doc 2: Steve teaches at Brown University.
Doc 3: Data Scientists work on large datasets.

Let’s say we are doing a search on these documents with the following query: Data Scientists

The query is a free text query. It means a query in which the terms of the query are typed freeform into the search interface, without any connecting search operators.

### Step 1: Computing the Term Frequency(tf)

Frequency indicates the number of occurences of a particular term t in document d. Therefore,

```tf(t, d) = N(t, d), wherein tf(t, d) = term frequency for a term t in document d.

N(t, d)  = number of times a term t occurs in document d```

We can see that as a term appears more in the document it becomes more important, which is logical.We can use a vector to represent the document in bag of words model, since the ordering of terms is not important. There is an entry for each unique term in the document with the value being its term frequency.

Given below are the terms and their frequency on each of the document. [N(t, d)]

`tf for document 1:`

Doc 1BenStudiesComputerLab
tf1121

Vector Space Representation for Doc 1 : [1, 1, 2, 1]

`tf for document 2:`

Doc 2SteveteachesBrownUniversity
tf1111

Vector Space Representation for Doc 2 : [1, 1, 1, 1]

`tf for document 3:`

Doc 3DataScientistsworklargedatasets
tf11111

Vector Space Representation for Doc 3 : [1, 1, 1, 1, 1]

Thus, the representation of documents as vectors in a common vector space is known as Vector Space Model and it’s very fundamental to information retrieval.

Since we are dealing with the term frequency which rely on the occurrence counts, thus, longer documents will be favored more. To avoid this, normalize the term frequency

```tf(t, d) = N(t, d) / ||D||
wherein, ||D|| = Total number of term in the document```

`||D|| for each document:`

Documents||D||
17
25
36

Given below are the normalized term frequency for all the documents, i.e. [N(t, d) / ||D||]

`Normalized TF for Document 1:`

Doc1BenstudiesComputerLab
Normalized Tf0.1430.1430.2860.143

Vector Space Representation for Document 1 : [0.143, 0.143, 0.286, 0.143]

`Normalized tf for document 2:`

Doc 2SteveteachesBrownUniversity
NormalizedTf0.20.20.20.2

Vector Space Representation for Document 2 : [0.2, 0.2, 0.2, 0.2]

`Normalized tf for document 3:`

Doc 3DataScientistsworklargedatasets
NormalizedTf0.1670.1670.1670.1670.167

Vector Space Representation for Document 3 : [0.167, 0.167, 0.167, 0.167, 0.167]

Below function in Python will do the normalized TF calculation:

 `def` `termFrequency(term, doc): ``     ` `    ``"""``    ``Input: term: Term in the Document, doc: Document``    ``Return: Normalized tf: Number of times term occurs``      ``in document/Total number of terms in the document``    ``"""``    ``# Splitting the document into individual terms``    ``normalizeTermFreq ``=` `doc.lower().split() `` ` `    ``# Number of times the term occurs in the document``    ``term_in_document ``=` `normalizeTermFreq.count(term.lower()) `` ` `    ``# Total number of terms in the document``    ``len_of_document ``=` `float``(``len``(normalizeTermFreq )) `` ` `    ``# Normalized Term Frequency``    ``normalized_tf ``=` `term_in_document ``/` `len_of_document `` ` `    ``return` `normalized_tf`

### Step 2: Compute the Inverse Document Frequency – idf

It typically measures how important a term is. The main purpose of doing a search is to find out relevant documents matching the query. Since `tf `considers all terms equally important, thus, we can’t only use term frequencies to calculate the weight of a term in the document. However, it is known that certain terms, such as “is”, “of”, and “that”, may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scaling up the rare ones. Logarithms helps us to solve this problem.

First of all, find the document frequency of a term t by counting the number of documents containing the term:

```df(t) = N(t)

where-
df(t) = Document frequency of a term t
N(t) = Number of documents containing the term t```

Term frequency is the occurrence count of a term in one particular document only; while document frequency is the number of different documents the term appears in, so it depends on the whole corpus. Now let’s look at the definition of inverse document frequency. The idf of a term is the number of documents in the corpus divided by the document frequency of a term.

`idf(t) = N/ df(t) = N/N(t)`

It’s expected that the more frequent term to be considered less important, but the factor (most probably integers) seems too harsh. Therefore, we take the logarithm (with base 2 ) of the inverse document frequencies. So, the idf of a term t becomes :

`idf(t) = log(N/ df(t))`

This is better, and since log is a monotonically increasing function we can safely use it. Let’s compute IDF for the term Computer:

`idf(computer) = log(Total Number Of Documents / Number Of Documents with term Computer in it)`

There are 3 documents in all = Document1, Document2, Document3

```The term Computer appears in Document1

idf(computer) = log(3 / 1)
= 1.5849```

Given below is the idf for terms occurring in all the documents-

GivenNo. of documents in which term appears(Nt)idf = Log(N/Nt)
Ben1Log(3/1)=1.5849
Studies1Log(3/1)=1.5849
Computer1Log(3/1)=1.5849
Lab1Log(3/1)=1.5849
Steve1Log(3/1)=1.5849
Teaches1Log(3/1)=1.5849
Brown1Log(3/1)=1.5849
University1Log(3/1)=1.5849
Data1Log(3/1)=1.5849
Scientists1Log(3/1)=1.5849
Work1Log(3/1)=1.5849
Large1Log(3/1)=1.5849
Dataset1Log(3/1)=1.5849

Given below is the function in python to calculate idf:

 `def` `inverseDocumentFrequency(term, allDocs):``    ``num_docs_with_given_term ``=` `0`` ` `    ``"""``    ``Input: term: Term in the Document,``           ``allDocs: List of all documents``    ``Return: Inverse Document Frequency (idf) for term``            ``= Logarithm ((Total Number of Documents) / ``            ``(Number of documents containing the term))``    ``"""``    ``# Iterate through all the documents``    ``for` `doc ``in` `allDocs:``         ` `        ``"""``        ``Putting a check if a term appears in a document.``        ``If term is present in the document, then ``        ``increment "num_docs_with_given_term" variable``        ``"""``        ``if` `term.lower() ``in` `allDocs[doc].lower().split():``            ``num_docs_with_given_term ``+``=` `1`` ` `    ``if` `num_docs_with_given_term > ``0``:``        ``# Total number of documents``        ``total_num_docs ``=` `len``(allDocs) `` ` `        ``# Calculating the IDF ``        ``idf_val ``=` `log(``float``(total_num_docs) ``/` `num_docs_with_given_term)``        ``return` `idf_val``    ``else``:``        ``return` `0`

### Step 3: tf-idf Scoring

Now we have defined both tf and idf and now we can combine these to produce the ultimate score of a term t in document d. Therefore,

`tf-idf(t, d) = tf(t, d)* idf(t, d)`

For each term in the query multiply its normalized term frequency with its IDF on each document. In Document3 for the term data, the normalized term frequency is 0.167 and its IDF is 1.5849. Multiplying them together we get 0.2646. Given below is TF * IDF calculations for data and Scientists in all the documents.

Doc 1Doc 2Doc 3
Data

000.2646
Scientists

000.2646

We will use any of the similarity measures (eg, Cosine Similarity method) to find the similarity between the query and each document. For example, if we use Cosine Similarity Method to find the similarity, then smallest the angle, the more is the similarity.

Using the formula given below we can find out the similarity between any two documents, let’s say d1, d2.

```Cosine Similarity (d1, d2) =  Dot product(d1, d2) / ||d1|| * ||d2||

Dot product (d1, d2) = d1 * d2 + d1 * d2 * … * d1[n] * d2[n]
||d1|| = square root(d1^2 + d1^2 + ... + d1[n]^2)
||d2|| = square root(d2^2 + d2^2 + ... + d2[n]^2)```

References:

My Personal Notes arrow_drop_up