Web Scraping using R Language

One of the most important things in the field of Data Science is the skill of getting the right data for the problem you want to solve. Data Scientists don’t always have a prepared database to work on but rather have to pull data from the right sources. For this purpose, APIs and Web Scraping are used.

  • API (Application Program Interface): An API is a set of methods and tools that allows one’s to query and retrieve data dynamically. Reddit, Spotify, Twitter, Facebook, and many other companies provide free APIs that enable developers to access the information they store on their servers; others charge for access to their APIs.
  • Web Scraping: A lot of data isn’t accessible through data sets or APIs but rather exists on the internet as Web pages. So, through web-scraping, one can access the data without waiting for the provider to create an API.

What’s Web Scraping?

Web scraping is a technique to fetch data from websites. While surfing on the web, many websites don’t allow the user to save data for private use. One way is to manually copy-paste the data, which both tedious and time-consuming. Web Scraping is the automatic process of data extraction from websites. This process is done with the help of web scraping software known as web scrapers. They automatically load and extract data from the websites based on user requirements. These can be custom built to work for one site or can be configured to work with any website.

Implementation of web scraping using R

There are several web scraping tools out there to perform the task and various languages too, having libraries that support web scraping. Among all these languages, R is considered as one of the programming languages for Web Scraping because of features like – a rich library, easy to use, dynamically typed, etc. The commonly used web Scraping tools for R is rvest.

Prerequisites:

  • Install the package rvest in your R Studio using the following code.
    install.packages('rvest')
  • Having, knowledge of HTML and CSS will be an added advantage. It’s observed that most of the Data Scientists are not very familiar with technical knowledge of HTML and CSS. Therefore, let’s using an open-source software named Selector Gadget which will be more than sufficient for anyone in order to perform Web scraping. One can access and download the Selector Gadget extension here. Consider that one has this extension installed by following the instructions from the website. Also, consider one using Google chrome and he/she can access the extension in the extension bar to the top right.

Steps involved in Web Scraping:



  • Step 1: Before started coding import rvest libraries to your R Studio.
    filter_none

    edit
    close

    play_arrow

    link
    brightness_4
    code

    library(rvest)

    chevron_right

    
    

  • Step 2: Read the HTML code from the webpage. Consider this webpage.
    filter_none

    edit
    close

    play_arrow

    link
    brightness_4
    code

    webpage = read_html("https://www.geeksforgeeks.org / data-structures-in-r-programming")

    chevron_right

    
    

  • Step 3: Now, let’s start by scraping the heading field. For that, use the selector gadget to get the specific CSS selectors that enclose the heading. One can click on the extension in his/her browser and select the heading field with the cursor.

    Once one knows the CSS selector that contains the heading, he/she can use this simple R code to get the heading.

    filter_none

    edit
    close

    play_arrow

    link
    brightness_4
    code

    # Using CSS selectors to scrape the heading section
    heading = html_node(webpage, '.entry-title')
      
    # Converting the heading data to text
    text = html_text(heading)
    print(text)

    chevron_right

    
    

    Output:

    [1] "Data Structures in R Programming"
    
  • Step 4: Now, let’s scraping the all paragraph fields. For that did the same procedure as we did before.

    Once one knows the CSS selector that contains the paragraphs, he/she can use this simple R code to get all the paragraphs.

    filter_none

    edit
    close

    play_arrow

    link
    brightness_4
    code

    # Using CSS selectors to scrape 
    # all the paragraph section
    # Note that we use html_nodes() here
    paragraph = html_nodes(webpage, 'p')
      
    # Converting the heading data to text
    pText = html_text(paragraph)
      
    # Print the top 6 data
    print(head(pText))

    chevron_right

    
    

    Output:

    [1] “A data structure is a particular way of organizing data in a computer so that it can be used effectively. The idea is to reduce the space and time complexities of different tasks. Data structures in R programming are tools for holding multiple values. ”
    [2] “R’s base data structures are often organized by their dimensionality (1D, 2D, or nD) and whether they’re homogeneous (all elements must be of the identical type) or heterogeneous (the elements are often of various types). This gives rise to the five data types which are most frequently utilized in data analysis. the subsequent table shows a transparent cut view of those data structures.”
    [3] “The most essential data structures used in R include:”
    [4] “”
    [5] “A vector is an ordered collection of basic data types of a given length. The only key thing here is all the elements of a vector must be of the identical data type e.g homogenous data structures. Vectors are one-dimensional data structures.”
    [6] “Example:”

The complete R code is given below.

filter_none

edit
close

play_arrow

link
brightness_4
code

# R program to illustrate
# Web Scraping
  
# Import rvest library
library(rvest)
  
# Reading the HTML code from the website
webpage = read_html("https://www.geeksforgeeks.org / data-structures-in-r-programming")
  
# Using CSS selectors to scrape the heading section
heading = html_node(webpage, '.entry-title')
  
# Converting the heading data to text
text = html_text(heading)
print(text)
  
# Using CSS selectors to scrape 
# all the paragraph section
# Note that we use html_nodes() here
paragraph = html_nodes(webpage, 'p')
  
# Converting the heading data to text
pText = html_text(paragraph)
  
# Print the top 6 data
print(head(pText))

chevron_right





My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.


Article Tags :

Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.