Open In App

Web Scraping Using RSelenium

RSelenium is a powerful R package for automating web browsers. It allows web scraping by interacting with a web page using a real web browser, including performing tasks such as clicking links, filling out forms, and scrolling through a page. RSelenium is particularly useful for web pages that require user interaction, such as login screens or dynamic pages that load new content as the user scrolls.

RSelenium requires a running instance of a web browser, which it can control via a web driver. The most common web driver used with RSelenium is the Selenium WebDriver, which supports popular web browsers such as Chrome, Firefox, and Safari. To use RSelenium, you will need to install the package and a compatible web driver. Once you have done that, you can start automating your web browser and interacting with web pages.



Some of the key concepts related to web scraping using RSelenium are:

Overall, RSelenium is a powerful tool for web scraping that provides a range of functionalities for extracting data from web pages. It allows web scraping tasks to be automated and can handle complex web pages with dynamic content.



Let’s say we want to scrape data from the website https://www.worldometers.info/coronavirus/ which provides information about the COVID-19 pandemic.




library(tidyverse)
library(RSelenium)
library(rvest)
library(httr)
 
rD <- rsDriver(browser = "firefox",
               chromever = NULL)
remDr <- rD$client
 
 
# Extract the total number of cases
total_cases <- remDr$findElement(using = "xpath",
                                 value = '//*[@id="maincounter-wrap"]/div/span')
total_cases <- total_cases$getElementText()[[1]]
 
# Extract the total number of deaths
total_deaths <- remDr$findElement(using = "xpath",
                                  value = '/html/body/div[3]/div[2]/div[1]/div/div[6]/div/span')
total_deaths <- total_deaths$getElementText()[[1]]
 
# Extract the total number of recoveries
total_recoveries <- remDr$findElement(using = "xpath",
                                      value = '/html/body/div[3]/div[2]/div[1]/div/div[7]/div/span')
total_recoveries <- total_recoveries$getElementText()[[1]]
 
# Print the extracted data
cat("Total Cases: ", total_cases, "\n")
cat("Total Deaths: ", total_deaths, "\n")
cat("Total Recoveries: ", total_recoveries, "\n")
 
# Close the server
remDr$close()
selServ$stop()

Output:

Total Cases:  685,740,983
Total Deaths:  6,842,948
Total Recoveries:  658,490,977

Now let’s try to fetch the top 5 articles from the BBC News website. This code starts a Selenium server, opens a Chrome browser window, navigates to the BBC News website, waits for the page to load, finds the top 5 articles on the page, extracts the titles and URLs of those articles, and prints them to the console.




# Load libraries
library(RSelenium)
library(rvest)
 
# Start RSelenium server
selServ <- selenium(jvmargs = c("-Dwebdriver.chrome.driver=/usr/bin/chromedriver"))
remDr <- remoteDriver(port = 4445L, browserName = "chrome")
remDr$open()
 
# Navigate to BBC News website
remDr$navigate("https://www.bbc.com/news")
 
# Wait for the page to load
Sys.sleep(5)
 
# Find the top 5 articles
article_links <- remDr$findElements(using = "css", "#top-story a")
 
# Extract the titles and URLs of the top 5 articles
article_titles <- sapply(article_links[1:5], function(x) x$getElementText())
article_urls <- sapply(article_links[1:5], function(x) x$getElementAttribute("href"))
 
# Print the titles and URLs
for (i in 1:5) {
  cat(paste0(i, ". ", article_titles[i], "\n"))
  cat(article_urls[i], "\n\n")
}
 
# Close the browser and stop the server
remDr$close()
selServ$stop()

Output:

 

This output shows the titles and URLs of the top 5 articles on the BBC News website at the time the code was run. The titles are numbered and each is followed by the corresponding URL.


Article Tags :