The task is to count the most frequent words, which extracts data from dynamic sources.
First, create a web crawler or scraper with the help of the requests module and a beautiful soup module, which will extract data from the web pages and store them in a list. There might be some undesired words or symbols (like special symbols, blank spaces), which can be filtered in order to ease the counts and get the desired results.
After counting each word, we also can have the count of most (say 10 or 20) frequent words.
Modules and Library functions used :
requests : Will allow you to send HTTP/1.1 requests and many more.
beautifulsoup4 : Used for parsing HTML/XML to extract data out of HTML and XML files.
operator : Exports a set of efficient functions corresponding to the intrinsic operators.
collections : Implements high-performance container datatypes.
Below is an implementation of the idea discussed above :
Python3
import requests
from bs4 import BeautifulSoup
import operator
from collections import Counter
def start(url):
wordlist = []
source_code = requests.get(url).text
soup = BeautifulSoup(source_code, 'html.parser' )
for each_text in soup.findAll( 'div' , { 'class' : 'entry-content' }):
content = each_text.text
words = content.lower().split()
for each_word in words:
wordlist.append(each_word)
clean_wordlist(wordlist)
def clean_wordlist(wordlist):
clean_list = []
for word in wordlist:
symbols = "!@#$%^&*()_-+={[}]|\;:\"<>?/., "
for i in range ( len (symbols)):
word = word.replace(symbols[i], '')
if len (word) > 0 :
clean_list.append(word)
create_dictionary(clean_list)
def create_dictionary(clean_list):
word_count = {}
for word in clean_list:
if word in word_count:
word_count[word] + = 1
else :
word_count[word] = 1
c = Counter(word_count)
top = c.most_common( 10 )
print (top)
if __name__ = = '__main__' :
start(url)
|
[('to', 10), ('in', 7), ('is', 6), ('language', 6), ('the', 5),
('programming', 5), ('a', 5), ('c', 5), ('you', 5), ('of', 4)]