Open In App

Web Scraping R Data From JSON

Last Updated : 05 Feb, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Many websites provide their data in JSON format to be used. This data can be used by us for analysis in R.  JSON (JavaScript Object Notation) is a text-based format for representing structured data based on JavaScript object syntax. In this article, we will see how to scrap data from a JSON web source and use it in R Programming Language.

Web Scraping Data from JSON

We are going to scrap this JSON data about COVID cases available here, which will look something like this:

Raw JSON Data

Raw JSON data

To get this data in R, we need a library called jsonlite. This doesn’t come pre-installed in R. So, install it. It is also necessary to include it in the script.

R




# installing the library
install.packages("jsonlite")
  
# include it
library(jsonlite)


Now that we have loaded the library, we can use fromJSON function to parse the data. Pass the same URL given above about the COVID data to the function where raw data is a large list containing the information.

R




# scraping the JSON data to R


Accessing the data

Let us now try to access some information. If we simply output rawdata, we get this long list which contains the data of COVID cases across all states and union territories in India:

 

 If we want to get the data only for Delhi that is DL, we can look into the JSON format on the webpage and try this:

R




# getting the information for DL (Delhi)
data <- rawdata['DL']


which outputs all the information about Delhi:

Data

If we want information about a specific date, we can do so by accessing data for the date 2020-03-16. As you can see in JSON:

R




# data for specific date
data[[1]][[1]][[15]]


Output:

Data for DL on 16 March 2020

You can go ahead and perform various actions on this list like analyzing the data, plotting graphs, and much more! See this tutorial on R lists, to try such actions.

Web Scraping using XPath

To scrape data directly from an HTML element, we can use something called XPath. The XPath of the element can be found using the inspect element. In Chrome browser, 

right-click> inspect > right click on the element > click Copy full XPath.

First, we need to install the rvest package, a library to scrape web pages.

install.packages('rvest')

So, suppose we are interested in scraping the timetable for train no. 14553 on trainman.in that is on this URL:

https://www.trainman.in/train/14553

Then select the first-row element in the timetable in inspect.

Exemplary website to be scraped

Exemplary website to be scraped for demonstration purpose

Go ahead and copy the XPath as mentioned above. It will be something like this or might change:

/html/body/app-root/app-wrapper/div/main/train-schedule
/div[2]/div[1]/div/div[3]/table/tbody/tr[1]
Web Scraping R Data From JSON

Select the Copy to fill XPath option to get the menu

The XPath that we got is for one row only. What about the rest of the rows? For that remove the subscript part from tr[2]:

/html/body/app-root/app-wrapper/div/main/train-schedule
/div[2]/div[1]/div/div[3]/table/tbody/tr

So now it gives not only one row but all the rows in the table. To scrape this in R, call the URL, and store it. Now get the HTML data by calling read_html(URL). Now to filter out the specific element use html_nodes() passing the page and XPath. And use %>% html_text() to only get the text part that is excluding the tags and details.

R




# include the installed library rvest
library(rvest)
  
# call the url
  
# get the data
page <- read_html(url)
  
# filter the required data using xpath
rows <- html_nodes(page, xpath = "/html/body/app-root/app-wrapper/div/main/train-schedule/div[2]/div[1]/div/div[3]/table/tbody/tr") %>% html_text()
  
# print
rows


Output:

Data Scraped from the website

Data Scraped from the website

If we have simply copied the XPath of the <table> tag then we would have got only one entry containing all the stations as opposed to 25 entries.

Raw data from the website

Raw data from the website

This method is not only for table tags but it works for any HTML element and there can be minor differences according to the structure of the webpage.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads