Web Scraping Using RSelenium

RSelenium is a powerful R package for automating web browsers. It allows web scraping by interacting with a web page using a real web browser, including performing tasks such as clicking links, filling out forms, and scrolling through a page. RSelenium is particularly useful for web pages that require user interaction, such as login screens or dynamic pages that load new content as the user scrolls.

RSelenium requires a running instance of a web browser, which it can control via a web driver. The most common web driver used with RSelenium is the Selenium WebDriver, which supports popular web browsers such as Chrome, Firefox, and Safari. To use RSelenium, you will need to install the package and a compatible web driver. Once you have done that, you can start automating your web browser and interacting with web pages.

Some of the key concepts related to web scraping using RSelenium are:

Web Drivers: RSelenium uses web drivers to interact with web browsers. A web driver is a software component that acts as a bridge between the web browser and the RSelenium package. It is responsible for controlling the browser and executing the user’s commands.
Selenium Server: A Selenium server is a server that acts as an intermediary between the web browser and the RSelenium package. It is used to handle multiple browser sessions and to distribute the load across different web browsers.
CSS Selectors: CSS selectors are patterns used to select the HTML elements from a web page. They are used to identify the elements on a web page that need to be scraped.
XPath: XPath is a query language used to navigate through the elements and attributes of an XML document or HTML page. It is used to locate the elements on a web page that need to be scraped.
AJAX: AJAX (Asynchronous JavaScript and XML) is a technique used to update the content of a web page without reloading the entire page. RSelenium can handle AJAX-based web pages by waiting for the AJAX content to load before scraping the data.
Forms: RSelenium can interact with web forms by filling out the form fields and submitting the form. This is useful for web scraping tasks that require user input.

Overall, RSelenium is a powerful tool for web scraping that provides a range of functionalities for extracting data from web pages. It allows web scraping tasks to be automated and can handle complex web pages with dynamic content.

Let’s say we want to scrape data from the website https://www.worldometers.info/coronavirus/ which provides information about the COVID-19 pandemic.

library(tidyverse)

library(RSelenium)

library(rvest)

library(httr)
 
rD <- rsDriver(browser = "firefox",

               chromever = NULL)
remDr <- rD$client
 
remDr$navigate("https://www.worldometers.info/coronavirus/")
 
# Extract the total number of cases

total_cases <- remDr$findElement(using = "xpath",

                                 value = '//*[@id="maincounter-wrap"]/div/span')

total_cases <- total_cases$getElementText()[[1]]
 
# Extract the total number of deaths

total_deaths <- remDr$findElement(using = "xpath",

                                  value = '/html/body/div[3]/div[2]/div[1]/div/div[6]/div/span')

total_deaths <- total_deaths$getElementText()[[1]]
 
# Extract the total number of recoveries

total_recoveries <- remDr$findElement(using = "xpath",

                                      value = '/html/body/div[3]/div[2]/div[1]/div/div[7]/div/span')

total_recoveries <- total_recoveries$getElementText()[[1]]
 
# Print the extracted data

cat("Total Cases: ", total_cases, "\n")

cat("Total Deaths: ", total_deaths, "\n")

cat("Total Recoveries: ", total_recoveries, "\n")
 
# Close the server

remDr$close()

selServ$stop()

Output:

Total Cases:  685,740,983
Total Deaths:  6,842,948
Total Recoveries:  658,490,977

Now let’s try to fetch the top 5 articles from the BBC News website. This code starts a Selenium server, opens a Chrome browser window, navigates to the BBC News website, waits for the page to load, finds the top 5 articles on the page, extracts the titles and URLs of those articles, and prints them to the console.

# Load libraries

library(RSelenium)

library(rvest)
 
# Start RSelenium server

selServ <- selenium(jvmargs = c("-Dwebdriver.chrome.driver=/usr/bin/chromedriver"))

remDr <- remoteDriver(port = 4445L, browserName = "chrome")

remDr$open()
 
# Navigate to BBC News website

remDr$navigate("https://www.bbc.com/news")
 
# Wait for the page to load

Sys.sleep(5)
 
# Find the top 5 articles

article_links <- remDr$findElements(using = "css", "#top-story a")
 
# Extract the titles and URLs of the top 5 articles

article_titles <- sapply(article_links[1:5], function(x) x$getElementText())

article_urls <- sapply(article_links[1:5], function(x) x$getElementAttribute("href"))
 
# Print the titles and URLs

for (i in 1:5) {

  cat(paste0(i, ". ", article_titles[i], "\n"))

  cat(article_urls[i], "\n\n")
}
 
# Close the browser and stop the server

remDr$close()

selServ$stop()

Output:

This output shows the titles and URLs of the top 5 articles on the BBC News website at the time the code was run. The titles are numbered and each is followed by the corresponding URL.

Article Tags :

R Language