Open In App

Extract Author’s information from Geeksforgeeks article using Python

Last Updated : 25 Aug, 2021
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to write a python script to extract author information from GeeksforGeeks article.

Module needed

  • bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python. To install this type the below command in the terminal.
pip install bs4
  • requests: Requests allows you to send HTTP/1.1 requests extremely easily. This module also does not comes built-in with Python. To install this type the below command in the terminal.
pip install requests

Approach:

  • Import module
  • Make requests instance and pass into URL
  • Initialize the article Title
  • Pass URL into a getdata()
  • Scrape the data with the help of requests and Beautiful Soup
  • Find the required details and filter them.

Stepwise execution of scripts:

Step 1: Import all dependence

Python




# import module
import requests
from bs4 import BeautifulSoup


 
Step 2: Create a URL get function 

Python3




# link for extract html data
# Making a GET request
     
def getdata(url):
    r=requests.get(url)
    return r.text


Step 3: Now merge the Article name into URL and pass the URL into the getdata() function and Convert that data into HTML code 

Python3




# input article by geek
article = "optparse-module-in-python"
 
# url
 
# pass the url
# into getdata function
htmldata=getdata(url)
soup = BeautifulSoup(htmldata, 'html.parser')
 
# display html code
print(soup)


Output: 

Step 4: Traverse the author’s name from the HTML document. 

Python




# traverse author name
for i in soup.find('div', class_="author_handle"):
    Author = i.get_text()
print(Author)


Output: 

kumar_satyam

Step 5: Now create a URL with author-name and get HTML code. 

Python3




# now get author information
# with author name
profile ='https://auth.geeksforgeeks.org/user/'+Author+'/profile'
 
# pass the url
# into getdata function
htmldata=getdata(profile)
soup = BeautifulSoup(htmldata, 'html.parser')


Step 6: Traverse the author’s information.

Python3




# traverse information of author
name = soup.find(
    'div', class_='mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold medText').get_text()
 
 
author_info = []
for item in soup.find_all('div', class_='mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold'):
    author_info.append(item.get_text())
 
print("Author name :")
print(name)
print("Author information  :")
print(author_info)


Output:

Author name : Satyam Kumar 
Author information  : 
[‘LNMI patna’, ‘\nhttps://www.linkedin.com/in/satyam-kumar-174273101/’] 
 

Complete code:

Python3




# import module
import requests
from bs4 import BeautifulSoup
 
# link for extract html data
# Making a GET request
 
 
def getdata(url):
    r = requests.get(url)
    return r.text
 
 
# input article by geek
article = "optparse-module-in-python"
 
# url
 
 
# pass the url
# into getdata function
htmldata = getdata(url)
soup = BeautifulSoup(htmldata, 'html.parser')
 
# traverse author name
for i in soup.find('div', class_="author_handle"):
    Author = i.get_text()
 
# now get author information
# with author name
profile = 'https://auth.geeksforgeeks.org/user/'+Author+'/profile'
 
# pass the url
# into getdata function
htmldata = getdata(profile)
soup = BeautifulSoup(htmldata, 'html.parser')
 
# traverse information of author
name = soup.find(
    'div', class_='mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold medText').get_text()
 
 
author_info = []
for item in soup.find_all('div', class_='mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold'):
    author_info.append(item.get_text())
 
print("Author name :", name)
print("Author information  :")
print(author_info)


Output:

Author name : Satyam Kumar 
Author information  : 
[‘LNMI patna’, ‘\nhttps://www.linkedin.com/in/satyam-kumar-174273101/’] 
 



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads