Open In App

Project Idea – Searching news from Old Newspaper using NLP

We know that the newspaper is an enriched source of knowledge. When a person needs some information about a particular topic or subject he searches online, but it is difficult to get all old news articles from regional local newspapers related to our search. As not every local newspaper provides an online search for people.In this article, we will present an idea to overcome this problem.

What project does?

Why NLP ?

Newspaper articles contain many articles, prepositions, and other stop words that are not useful to us, so NLP helps us to remove those stop words. It also helps to get unique words.



Technologies used :

Tools used :

Libraries used:

Use Case Diagram

Step By step Implementation:

Libraries installation

First, Install required libraries on colab. 






!pip install nltk
!pip install pytesseract
  
!sudo apt install tesseract-ocr
  
# to check if it installed properly
# !which tesseract
# pytesseract.pytesseract.tesseract_cmd = (
#     r'/usr/bin/tesseract'
# )

Let’s import all the necessary libraries:




import io
import glob
import os
from PIL import Image
import cv2
import pytesseract 
# /usr/bin/tesseract
import pandas as pd
import nltk
nltk.download('popular')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from IPython.display import Image
from google.colab.patches import cv2_imshow

pre function

This will clean the text to get important names, keywords, etc. Stop words and duplicate words are removed by the below function.




def pre(text):
    text = text.lower()
    tokenizer = RegexpTokenizer(r'\w+')
    new_words = tokenizer.tokenize(text)
    stop_words = list(stopwords.words("english"))
    filtered_words = []
      
    for w in new_words:
        if w not in stop_words:
            filtered_words.append(w)
    unique = []
  
    for w in filtered_words:
        if w not in unique:
            unique.append(w)
    res = ' '.join([str(elem) for elem in unique])
    res = res.lower()
      
    return res

to_df function 

when given image path as a parameter it returns preprocessed text in the text variable. then this text is passed as a parameter to pre(). this function returns a dictionary with filename and important text.




def to_df(imgno):
  text = pytesseract.image_to_string(imgno)
  out = pre(text)
  data = {'filename':imgno,
          'text':out}
  return data

Driver code

here we are defining the dataframe to store the dictionary which has an image path and the text inside the image. We will use this dataframe for searching.




i=0
dff=pd.DataFrame()

Listing all images in the content folder.




images = []
folder = "/content/"
  
for filename in os.listdir(folder):
    img = cv2.imread(os.path.join(folder, filename))
      
    if img is not None:
        print(filename)
        images.append(filename)

getting all images 

For loop to get all news images from the folder.




for u in images:
  i += 1
  data = to_df(u)
  dff = dff.append(pd.DataFrame(data, index=[i]))
  
print(dff)

dataframe 

Processing the images




# sample text output after processing image
dff.iloc[0]['text']

Saving the dataframe to database.

sample text after preprocessing

Saving the dataframe




# saving the dataframe
dff.to_csv('save newsdf.csv')

saved Dataframe 

Searching

Open the dataframe file from storage.




data = pd.read_csv('/content/save newsdf.csv')
data

open dataframe from storage

We provide a string as input for the function to get an image in which the keyword is present.




txt= 'modi'
index= data['text'].str.find(txt )
index

the non -1 row th images contain word ‘modi’

Showing the result




#  we are showing the first result here
for i in range(len(index)):
  
    if (index[i] != -1):
        a.append(i)
  
try:
    res = data.iloc[a[0]]['filename']
except:
    print("no file")
      
Image(res)

Result of the project

We have searched for the word ‘modi‘. The first newspaper which has our searched word in it so it’s shown here.

Scope for Improvement

We could use a dedicated database, like lucent or elastic search to make the search more efficient and fast.  But for the time being, we use the pandas library to get the path of the image to display to the user.

Project Application in Real-Life


Article Tags :