Scraping a Table on https site using R

Last Updated : 21 Mar, 2023

In this article, we have discussed the basics of web scraping using the R language and the rvest and tidyverse libraries. We have shown how to extract tables from a website and how to manipulate the resulting data. The examples in this article should provide a good starting point for anyone looking to scrape tables from websites. The following are the key concepts related to scraping tables in R:

Web scraping with R: R provides various libraries such as rvest and XML that can be used to extract data from websites.
Reading HTML: R can read HTML pages, and these pages can be parsed to extract the data we are interested in.
Selectors: To extract data from a website, we need to know the HTML structure of the page. Selectors in R allow us to select elements from the HTML page using CSS selectors or XPath.
Parsing HTML: After selecting the elements of interest, the next step is to parse the HTML content and extract the data.

Before we start scraping tables, the following prerequisites must be met, R should be installed on the system. The rvest library must be installed in R. If it’s not installed, it can be installed by running the following command in the R console:

install.packages("rvest")

Scraping a Table from a Static Website

In this example, we use the read_html function to read the HTML content of the website. Then we use the html_nodes function to select the table using a CSS selector. Finally, we extract the table content using the html_table function and print the first six rows of the table.

R

library(rvest) 
  
# Read the HTML content of the website 
webpage <- read_html("https://en.wikipedia.org/wiki/\ 
                List_of_countries_by_GDP_(PPP)_per_capita") 
  
# Select the table using CSS selector 
table_node <- html_nodes(webpage, "table") 
  
# Extract the table content 
table_content <- html_table(table_node)[[2]] 
  
# Print the table 
head(table_content)

Output:

Scraping a Table from a Dynamic Website

Scraping a table from a dynamic website, which is generated using JavaScript. In this example, the rvest library is used to read the HTML code of the webpage and extract the table. The html_nodes function is used to select the first table on the page, and the html_table function is used to convert the HTML code into a DataFrame. Finally, the first few rows of the data frame are displayed using the head function.

R

library(rvest) 
library(tidyverse) 
  
# URL of the website 
url <- "https://www.worldometers.info/world-population/\ 
                population-by-country/" 
  
# Read the HTML code of the page 
html_code <- read_html(url) 
  
# Use the html_nodes function to extract the table 
table_html <- html_code %>% html_nodes("table") %>% .[[1]] 
  
# Use the html_table function to convert the table  
# HTML code into a data frame 
table_df <- table_html %>% html_table() 
  
# Inspect the first few rows of the data frame 
head(table_df)