Open In App

Problem solving on Boolean Model and Vector Space Model

Last Updated : 30 May, 2021
Improve
Improve
Like Article
Like
Save
Share
Report

Boolean Model: 

It is a simple retrieval model based on set theory and boolean algebra. Queries are designed as boolean expressions which have precise semantics. Retrieval strategy is based on binary decision criterion. Boolean model considers that index terms are present or absent in a document.

Problem Solving: 

Consider 5 documents with a vocabulary of 6 terms

  • document 1 = ‘ term1 term3 ‘
  • document 2 = ‘ term 2 term4 term6 ‘
  • document 3 = ‘ term1 term2 term3 term4 term5 ‘
  • document 4 = ‘ term1 term3 term6 ‘
  • document 5 = ‘ term3 term4 ‘

Our documents in boolean model

 term 1term 2term 3term 4term 5term 6
document 1 101000
document 2010101
document 3111110
document 4101001
document 5001100

Consider the query

Find the document consisting of term1 and term3 and not term2

term1 ∧ term3 ∧ ¬ term2
 term1 Â¬term 2term 3term 4term 5term 6
document 1111000
document 2000101
document 3 101110
document 4111001
document  5011100
  • document 1 : 1 ∧ 1∧ 1 = 1
  • document 2 : 0 ∧ 0 ∧ 0 = 0
  • document 3 : 1 ∧ 1 ∧ 0 = 0
  • document 4 : 1 ∧ 1 ∧ 1 = 1
  • document 5 : 0 ∧ 1 ∧ 1 = 0

Based on the above computation document1 and document4 are relevant to the given query

Vector Model:

The method of performing the operations and the formulas required for the computation is present in the previous document that is part 1. Consider the following collection of documents.

  • document1 = ‘one two ‘
  • document2 = ‘three two four ‘
  • document3 =’one two three ‘
  • document4 =’one two ‘

The formulas used

tf_i,_j = \frac {freq_i,_j}{max_l(freq_l,_j)}

idf_i = log\frac{N}{n_i}

w_i,_j = tf_i * log\frac{N}{n_i}

sim(dj,q) = \frac{\sum_{i=1}^t w_i,_j * w_i,_q}{\sqrt{\sum_{i=1}^t w^2_i,_j} * \sqrt{\sum_{i=1}^t w^2_i,_q}}

Some terms appear thrice, twice and sometimes only once in the document.The total number of documents N=4. Therefore, the IDF values of the terms are:

one --> log2(4/3) = 0.4147
two --> log2(4/4) = 0
three --> log2(4/2) = 1
four -->log2(4/1) = 2

Representation in boolean model

 onetwothreefour
document11100
document20111
document31110
document41100

Calculation of term frequency

one --> 3/4 = 0.75
two --> 4/4 = 1
three --> 2/4 = 0.5
four --> 1/4 = 0.25

Calculation of weights ( tf * idf )

weight(one) --> 0.75 * 0.4147 = 0.3110
weight(two) --> 1 * 0 = 0
weight(three) --> 0.5 * 1 = 0.5
weight(four) --> 0.25 * 2 = 0.5

Representation of vector model in terms of weights

 onetwothreefour
document1 0.3110000
document2 000.50.5
document3 0.311000.50
document4 0.3110000

QUERY: Document containing ‘ one three three ‘

Calculation of weights for query terms(term frequency)

  • weight(one) –> 1/3 = 0.333
  • weight(three) –> 2/3 = 0.667

Vector representation

  • Document    \vec{d}_j = \{0.3110, 0, 0.5, 0.5 \}
  • Query \vec{q} = \{0.333, 0, 0.667, 0 \}

Similarity calculation: the 

sim(d1,q) = \frac{0.3110 * 0.333 + 0 * 0 + 0 * 0.667 + 0 * 0}{\sqrt{ (0.3110^2 + 0^2 + 0^2 + 0^2) } *\sqrt {(0.333^2+ 0^2 + 0.667^2 + 0^2)}} = 0.4466\\ sim(d2,q) = \frac{0 * 0.333 + 0 * 0 + 0.5 * 0.667 + 0.5 * 0}{\sqrt{ (0^2 + 0^2 + 0.5^2 + 0.5^2) } *\sqrt {(0.333^2 + 0^2 + 0.667^2 + 0^2)} }= 0.4001 \\ sim(d3,q) = \frac{0.3110 * 0.333 + 0 * 0 + 0.5 * 0.667 + 0 * 0}{\sqrt{ (0.3110^2 + 0^2 + 0.5^2 + 0^2)} * \sqrt{(0.333^2 + 0^2 + 0.667^2 + 0^2)}} = 0.9086\\ sim(d4,q) = \frac{0.3110 * 0.333 + 0 * 0 + 0 * 0.667 + 0 * 0}{\sqrt {(0.3110^2 + 0^2 + 0^2 + 0^2)} * \sqrt{(0.333^2 + 0^2 + 0.667^2 + 0^2)}} = 0.4466\\

Ranking of the documents, ( for ranking we have followed the method in statistics for the case of allocating same rank to two different items) 

document12nd
document24th
document31st
document42nd

Since the similarity between document 3 is greater than the similarities between the other documents, 3rd document is more relevant to the query.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads