Open In App

Scraping a Table on https site using R

Last Updated : 21 Mar, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we have discussed the basics of web scraping using the R language and the rvest and tidyverse libraries. We have shown how to extract tables from a website and how to manipulate the resulting data. The examples in this article should provide a good starting point for anyone looking to scrape tables from websites. The following are the key concepts related to scraping tables in R:

  1. Web scraping with R: R provides various libraries such as rvest and XML that can be used to extract data from websites.
  2. Reading HTML: R can read HTML pages, and these pages can be parsed to extract the data we are interested in.
  3. Selectors: To extract data from a website, we need to know the HTML structure of the page. Selectors in R allow us to select elements from the HTML page using CSS selectors or XPath.
  4. Parsing HTML: After selecting the elements of interest, the next step is to parse the HTML content and extract the data.

Before we start scraping tables, the following prerequisites must be met, R should be installed on the system. The rvest library must be installed in R. If it’s not installed, it can be installed by running the following command in the R console:

install.packages("rvest")

Scraping a Table from a Static Website

In this example, we use the read_html function to read the HTML content of the website. Then we use the html_nodes function to select the table using a CSS selector. Finally, we extract the table content using the html_table function and print the first six rows of the table.

R




library(rvest)
  
# Read the HTML content of the website
webpage <- read_html("https://en.wikipedia.org/wiki/\
                List_of_countries_by_GDP_(PPP)_per_capita")
  
# Select the table using CSS selector
table_node <- html_nodes(webpage, "table")
  
# Extract the table content
table_content <- html_table(table_node)[[2]]
  
# Print the table
head(table_content)


Output:

 

Scraping a Table from a Dynamic Website

Scraping a table from a dynamic website, which is generated using JavaScript. In this example, the rvest library is used to read the HTML code of the webpage and extract the table. The html_nodes function is used to select the first table on the page, and the html_table function is used to convert the HTML code into a DataFrame. Finally, the first few rows of the data frame are displayed using the head function.

R




library(rvest)
library(tidyverse)
  
# URL of the website
                population-by-country/"
  
# Read the HTML code of the page
html_code <- read_html(url)
  
# Use the html_nodes function to extract the table
table_html <- html_code %>% html_nodes("table") %>% .[[1]]
  
# Use the html_table function to convert the table 
# HTML code into a data frame
table_df <- table_html %>% html_table()
  
# Inspect the first few rows of the data frame
head(table_df)


Output:

 

 



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads