In this article, we will find the most similar sentence in the file to the input sentence.
Example:
File content: "This is movie." "This is romantic movie" "This is a girl." Input: "This is a boy" Similar sentence to input: "This is a girl", "This is movie".
Approach:
- Create a list to store all the unique words of the file.
- Convert all the sentences of the file into the binary format by comparing each word with the content of the list, after cleaning(removing stopword, stemming, etc.)
- Convert the input sentence in the binary format.
- Find the number of similar words in the input sentence to each sentence and store the value in the list named similarity index.
- Find the maximum value of similarity index and return the sentence having maximum similar words.
Content of the file:
Code to get a similar sentence:
Python3
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize
import nltk
from nltk.corpus import stopwords
nltk.download( 'stopwords' )
ps = PorterStemmer()
f = open ( 'romyyy.txt' )
a = sent_tokenize(f.read())
# removal of stopwords stop_words = list (stopwords.words( 'english' ))
# removal of punctuation signs punc = '''!()-[]{};:'"\, <>./?@#$%^&*_~'''
s = [(word_tokenize(a[i])) for i in range ( len (a))]
outer_1 = []
for i in range ( len (s)):
inner_1 = []
for j in range ( len (s[i])):
if s[i][j] not in (punc or stop_words):
s[i][j] = ps.stem(s[i][j])
if s[i][j] not in stop_words:
inner_1.append(s[i][j].lower())
outer_1.append( set (inner_1))
rvector = outer_1[ 0 ]
for i in range ( 1 , len (s)):
rvector = rvector.union(outer_1[i])
outer = []
for i in range ( len (outer_1)):
inner = []
for w in rvector:
if w in outer_1[i]:
inner.append( 1 )
else :
inner.append( 0 )
outer.append(inner)
comparison = input ( "Input: " )
check = (word_tokenize(comparison))
check = [ps.stem(check[i]).lower() for i in range ( len (check))]
check1 = []
for w in rvector:
if w in check:
check1.append( 1 ) # create a vector
else :
check1.append( 0 )
ds = []
for j in range ( len (outer)):
similarity_index = 0
c = 0
if check1 = = outer[j]:
ds.append( 0 )
else :
for i in range ( len (rvector)):
c + = check1[i] * outer[j][i]
similarity_index + = c
ds.append(similarity_index)
ds maximum = max (ds)
print ()
print ()
print ( "Similar sentences: " )
for i in range ( len (ds)):
if ds[i] = = maximum:
print (a[i])
|
Output: