tf-idf stands for Term frequency-inverse document frequency. The tf-idf weight is a weight often used in information retrieval and text mining. Variations of the tf-idf weighting scheme are often used by search engines in scoring and ranking a document’s relevance given a query. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus (data-set).
How to Compute:
tf-idf is a weighting scheme that assigns each term in a document a weight based on its term frequency (tf) and inverse document frequency (idf). The terms with higher weight scores are considered to be more important.
Typically, the tf-idf weight is composed by two terms-
- Normalized Term Frequency (tf)
- Inverse Document Frequency (idf)
Let’s us take 3 documents to show how this works.
Doc 1: Ben studies about computers in Computer Lab.
Doc 2: Steve teaches at Brown University.
Doc 3: Data Scientists work on large datasets.
Let’s say we are doing a search on these documents with the following query: Data Scientists
The query is a free text query. It means a query in which the terms of the query are typed freeform into the search interface, without any connecting search operators.
Step 1: Computing the Term Frequency(tf)
Frequency indicates the number of occurences of a particular term t in document d. Therefore,
tf(t, d) = N(t, d), wherein tf(t, d) = term frequency for a term t in document d. N(t, d) = number of times a term t occurs in document d
We can see that as a term appears more in the document it becomes more important, which is logical.We can use a vector to represent the document in bag of words model, since the ordering of terms is not important. There is an entry for each unique term in the document with the value being its term frequency.
Given below are the terms and their frequency on each of the document. [N(t, d)]
tf for document 1:
Vector Space Representation for Doc 1 : [1, 1, 2, 1]
tf for document 2:
Vector Space Representation for Doc 2 : [1, 1, 1, 1]
tf for document 3:
Vector Space Representation for Doc 3 : [1, 1, 1, 1, 1]
Thus, the representation of documents as vectors in a common vector space is known as Vector Space Model and it’s very fundamental to information retrieval.
Since we are dealing with the term frequency which rely on the occurrence counts, thus, longer documents will be favored more. To avoid this, normalize the term frequency
tf(t, d) = N(t, d) / ||D|| wherein, ||D|| = Total number of term in the document
||D|| for each document:
Given below are the normalized term frequency for all the documents, i.e. [N(t, d) / ||D||]
Normalized TF for Document 1:
Vector Space Representation for Document 1 : [0.143, 0.143, 0.286, 0.143]
Normalized tf for document 2:
Vector Space Representation for Document 2 : [0.2, 0.2, 0.2, 0.2]
Normalized tf for document 3:
Vector Space Representation for Document 3 : [0.167, 0.167, 0.167, 0.167, 0.167]
Below function in Python will do the normalized TF calculation:
Step 2: Compute the Inverse Document Frequency – idf
It typically measures how important a term is. The main purpose of doing a search is to find out relevant documents matching the query. Since
tf considers all terms equally important, thus, we can’t only use term frequencies to calculate the weight of a term in the document. However, it is known that certain terms, such as “is”, “of”, and “that”, may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scaling up the rare ones. Logarithms helps us to solve this problem.
First of all, find the document frequency of a term t by counting the number of documents containing the term:
df(t) = N(t) where- df(t) = Document frequency of a term t N(t) = Number of documents containing the term t
Term frequency is the occurrence count of a term in one particular document only; while document frequency is the number of different documents the term appears in, so it depends on the whole corpus. Now let’s look at the definition of inverse document frequency. The idf of a term is the number of documents in the corpus divided by the document frequency of a term.
idf(t) = N/ df(t) = N/N(t)
It’s expected that the more frequent term to be considered less important, but the factor (most probably integers) seems too harsh. Therefore, we take the logarithm (with base 2 ) of the inverse document frequencies. So, the idf of a term t becomes :
idf(t) = log(N/ df(t))
This is better, and since log is a monotonically increasing function we can safely use it. Let’s compute IDF for the term Computer:
idf(computer) = log(Total Number Of Documents / Number Of Documents with term Computer in it)
There are 3 documents in all = Document1, Document2, Document3
The term Computer appears in Document1 idf(computer) = log(3 / 1) = 1.5849
Given below is the idf for terms occurring in all the documents-
|Given||No. of documents in which term appears(Nt)||idf = Log(N/Nt)|
Given below is the function in python to calculate idf:
Step 3: tf-idf Scoring
Now we have defined both tf and idf and now we can combine these to produce the ultimate score of a term t in document d. Therefore,
tf-idf(t, d) = tf(t, d)* idf(t, d)
For each term in the query multiply its normalized term frequency with its IDF on each document. In Document3 for the term data, the normalized term frequency is 0.167 and its IDF is 1.5849. Multiplying them together we get 0.2646. Given below is TF * IDF calculations for data and Scientists in all the documents.
|Doc 1||Doc 2||Doc 3|
We will use any of the similarity measures (eg, Cosine Similarity method) to find the similarity between the query and each document. For example, if we use Cosine Similarity Method to find the similarity, then smallest the angle, the more is the similarity.
Using the formula given below we can find out the similarity between any two documents, let’s say d1, d2.
Cosine Similarity (d1, d2) = Dot product(d1, d2) / ||d1|| * ||d2|| Dot product (d1, d2) = d1 * d2 + d1 * d2 * … * d1[n] * d2[n] ||d1|| = square root(d1^2 + d1^2 + ... + d1[n]^2) ||d2|| = square root(d2^2 + d2^2 + ... + d2[n]^2)