Web Scraping Coronavirus Data into MS Excel

Prerequisites: Web Scraping using BeautifulSoap

Coronavirus cases are increasing rapidly worldwide. This article will guide you on how to web scrape Coronavirus data and into Ms-excel.

What is Web Scrapping?

If you’ve ever copy and pasted information from a website, you’ve performed the same function as any web scraper, only on a microscopic, manual scale. Web scraping, also known as online data mining, is the method of extracting or scraping data from a website. This knowledge is gathered and then translated to a medium that is more accessible to the user. It’s either a spreadsheet or an API.

Approach: 

  1. Request for a response from the webpage.
  2. Parse and extract with the help of the BeautifulsSoup() class method and lxml module.
  3. Download and export the data with pandas into Excel.

The Data Source:



We need a webpage to fetch the coronavirus data. So we will be using the Worldometer website here. Worldometer’s webpage will look something like this:

data source 

Programmatic Implementation

There are a few libraries you will need, so first, you need to install them.

Go to your command line and install them.

pip install requests
pip install lxml
pip install bs4

Now let’s see what we can do with these libraries.

Below are the steps for Web Scraping Coronavirus Data into Excel:

Step 1) Use the requests library to grab the page.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# Import required module
import requests
  
# Make requests from webpage

chevron_right


The request library that we downloaded goes and gets a response, to get a request from the webpage, we use requests.get(website URL) method. If the request is successful, it will be stored as a giant python string. We will be able to fetch the complete webpage source code when we run result.text. But the code will not be structured.



Note: This may fail if you have a firewall blocking Python/Jupyter. Sometimes you need to run this twice if it fails the first time.

Step 2) Use BeautifulSoap() method to extract data from websites. 

bs4 library already has lots of built-in tools and methods to grab information from a string of this nature (basically an HTML file). It is a Python library for pulling data out of HTML and XML files. Using BeautifulSoup() method of bs4 module we can create a soup object that contains all the ingredients of the webpage.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# Import required modules
import bs4
  
# Creating soap object
soup = bs4.BeautifulSoup(result.text,'lxml')

chevron_right


Importing bs4 is to create a BeautifulSoup object. And we’re going to pass on two things here, result.text string and lxml as a string as a constructor argument. lxml goes through this HTML document and then figures out different CSS classesids, HTML elements, and tags, etc.

Extracting the data, to find the element, you need to right-click and hit inspect on the number of cases. Refer to the attached snapshot below.

inspecting the website

We need to find the right class i.e. class_= ‘maincounter-number’ serves our purpose. Refer to the attached snapshot below.

Finding the right class

The BeautifulSoup object has been created in our Python script and the HTML data of the website has been scraped off of the page. Next, we need to get the data that we are interested in, out of the HTML code.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# Searching div tags having maincounter-number class
cases = soup.find_all('div' ,class_= 'maincounter-number')

chevron_right


Input Screenshot (Inspect element):



There is still a lot of HTML code that we do not want. Our desired data entries is wrapped in the HTML div element and inside class_= ‘maincounter-number’. We can use this knowledge to further clean up the scraped data.

Step 3) Storing the data

We need to save the scraped data in some form that can be used effectively. For this project, all the data will be saved in a Python list.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# List to store number of cases
data = []
  
# Find the span and get data from it
for i in cases:
    span = i.find('span')
    data.append(span.string)
  
# Dispaly number of cases 
print(data)

chevron_right


Input Screenshot (inspect element):

Output:

We will use span to fetch data from div. We need the number of cases only, not the tags. So we will use span.string to get those numbers, and then they are stored in data[].

Now that we have the number of cases, we are ready to export our data into an Excel file.



Step 4) Processing the data

Our last step is to export the data to Ms-excel, for which we are going to use the pandas module. To load the pandas module and start working with it, import the package. 

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

import pandas as pd
  
# Creating dataframe
df = pd.DataFrame({"CoronaData": data})
  
# Naming the coloumns
df.index = ['TotalCases', ' Deaths', 'Recovered']

chevron_right


DataFrame is a 2D labeled data structure, potentially heterogeneous tabular data structure with labeled axes (rows and columns).

df = pd.DataFrame({“CoronaData”: data}) is used to create a DataFrame and give it a name and map it to the data list that we created earlier.

Next, we will give column names with df.index.

Output: 

Step 5) Exporting data into Excel

We are ready to export the data into Excel. We will use df.to_csv() method for this task.



Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# Exporing data into Excel
df.to_csv('Corona_Data.csv')

chevron_right


Output:

Below is the complete program from the above steps:

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# Import required modules
import requests
import bs4
import pandas as pd
  
  
  
# Make requests from webpage
result = requests.get(url)
  
  
  
# Creating soap object
soup = bs4.BeautifulSoup(result.text,'lxml')
  
  
  
# Searching div tags having maincounter-number class
cases = soup.find_all('div' ,class_= 'maincounter-number')
  
  
  
# List to store number of cases
data = []
  
# Find the span and get data from it
for i in cases:
    span = i.find('span')
    data.append(span.string)
  
# Dispaly number of cases 
print(data)
  
  
    
# Creating dataframe
df = pd.DataFrame({"CoronaData": data})
  
# Naming the coloumns
df.index = ['TotalCases', ' Deaths', 'Recovered']
  
  
  
# Exporing data into Excel
df.to_csv('Corona_Data.csv')

chevron_right


Final Result:


Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.




My Personal Notes arrow_drop_up

Software Engineer at TCS | Technical Blogger | YouTuber

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.


Article Tags :

1


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.