In this article, we are going to write a python script to extract author information from GeeksforGeeks article.
Module needed
- bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python. To install this type the below command in the terminal.
pip install bs4
- requests: Requests allows you to send HTTP/1.1 requests extremely easily. This module also does not comes built-in with Python. To install this type the below command in the terminal.
pip install requests
Approach:
- Import module
- Make requests instance and pass into URL
- Initialize the article Title
- Pass URL into a getdata()
- Scrape the data with the help of requests and Beautiful Soup
- Find the required details and filter them.
Stepwise execution of scripts:
Step 1: Import all dependence
Python
# import module import requests from bs4 import BeautifulSoup |
Step 2: Create a URL get function
Python3
# link for extract html data # Making a GET request def getdata(url): r = requests.get(url) return r.text |
Step 3: Now merge the Article name into URL and pass the URL into the getdata() function and Convert that data into HTML code
Python3
# input article by geek article = "optparse-module-in-python" # url # pass the url # into getdata function htmldata = getdata(url) soup = BeautifulSoup(htmldata, 'html.parser' ) # display html code print (soup) |
Output:
Step 4: Traverse the author’s name from the HTML document.
Python
# traverse auther name for i in soup.find( 'div' , class_ = "author_handle" ): Author = i.get_text() print (Author) |
Output:
kumar_satyam
Step 5: Now create a URL with author-name and get HTML code.
Python3
# now get auther infromation # with auther name # pass the url # into getdata function htmldata = getdata(profile) soup = BeautifulSoup(htmldata, 'html.parser' ) |
Step 6: Traverse the author’s information.
Python3
# traverse information of auther name = soup.find( 'div' , class_ = 'mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold medText' ).get_text() author_info = [] for item in soup.find_all( 'div' , class_ = 'mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold' ): author_info.append(item.get_text()) print ( "Author name :" ) print (name) print ( "Author information :" ) print (author_info) |
Output:
Author name : Satyam Kumar
Author information :
[‘LNMI patna’, ‘\nhttps://www.linkedin.com/in/satyam-kumar-174273101/’]
Complete code:
Python3
# import module import requests from bs4 import BeautifulSoup # link for extract html data # Making a GET request def getdata(url): r = requests.get(url) return r.text # input article by geek article = "optparse-module-in-python" # url # pass the url # into getdata function htmldata = getdata(url) soup = BeautifulSoup(htmldata, 'html.parser' ) # traverse auther name for i in soup.find( 'div' , class_ = "author_handle" ): Author = i.get_text() # now get auther infromation # with auther name # pass the url # into getdata function htmldata = getdata(profile) soup = BeautifulSoup(htmldata, 'html.parser' ) # traverse information of auther name = soup.find( 'div' , class_ = 'mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold medText' ).get_text() author_info = [] for item in soup.find_all( 'div' , class_ = 'mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold' ): author_info.append(item.get_text()) print ( "Author name :" , name) print ( "Author information :" ) print (author_info) |
Output:
Author name : Satyam Kumar
Author information :
[‘LNMI patna’, ‘\nhttps://www.linkedin.com/in/satyam-kumar-174273101/’]
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.