How to Read multiple files parallelly and extract data in R

Last Updated : 17 Apr, 2023

In this article, we are going to learn how to read multiple files parallelly and extract data in R.

In R, reading files and extracting data from them can be done using various functions such as ‘read.table’, ‘read.csv’, and others. However, when working with a large number of files, reading them one by one can be time-consuming. To improve performance, it’s common to use parallel computing to read multiple files at the same time. Parallel computing is a method of running multiple computations simultaneously using multiple processors or cores. R provides several packages, such as ‘parallel’ and ‘purrr’, that allow you to take advantage of parallel computing.

Parallel library

The ‘parallel’ package provides several functions that allow performing parallel computation, such as ‘mclapply’, ‘mcMap’, and ‘parLapply’. These functions are similar to their non-parallel counterparts, ‘lapply’, ‘Map’, and ‘lapply‘, respectively, but they can be run in parallel.

Purrr library

‘purrr‘ package provides a set of functional programming tools which includes ‘map_df‘ function that can be used to read, extract and bind data from multiple files in parallel.

It’s important to note that parallel computing can improve the performance of the execution of code, but it also requires additional resources such as memory and can increase the complexity of the code. Thus it’s important to test the performance of the code before and after parallelizing it and decide whether the benefits outweigh the costs or not.

Example:

Here is an example of how we can use the parallel package to read multiple files and extract data from them. we have created three text files on which we apply the operation as follows.

sample1.txt

ID,Name,Age,Gender
1,John Smith,25,Male
2,Jane Doe,30,Female
3,Bob Johnson,35,Male

sample2.txt

ID,Name,Age,Gender
4,Emily Davis,22,Female
5,Michael Brown,28,Male
6,Jessica Wilson,32,Female

sample3.txt

ID,Name,Age,Gender
7,Jacob Miller,24,Male
8,Amanda Taylor,29,Female
9,Matthew Anderson,33,Male

Read multiple files parallelly using parallel package in R

In this method, we are going to read three text files parallelly which we created earlier. Below are the steps to implement this.

Step 1: Import the required library.

Step 2: List of file names created with the items “sample1.txt”, “sample2.txt”, and “sample3.txt” in it.

Step 3: Store the path of the text files in the variable “path”.

Step 4: An R function named ‘extract_data_from_file’ is defined, which takes a file name as input and reads the contents of that file using the ‘readLines’ function.

Step 5: A cluster of cores is created using the makeCluster() function, with the number of cores equal to the number of cores detected on the machine. The parLapply() function is then used to apply the extract_data_from_file() function in parallel to each file in the ‘file_list’.

Step 6: After the parallel processing is done, the cluster is stopped using the stopCluster() function.

Step 7: The results of the parallel processing are stored in the ‘results’ variable and are printed to the console using the print() function.

R

# Import required library 
library(parallel) 
  
# Create a list of file names 
file_list <- c("sample1.txt", 
               "sample2.txt",  
               "sample3.txt") 
path <- "C:/Users/sande/OneDrive/Desktop/files/"
file_list <- paste0(path, file_list) 
  
extract_data_from_file <- function(file) { 
  # code to extract data from file 
  data <- readLines(file) 
  # perform data extraction operations here 
  # e.g. strsplit(data, " ") 
  return(data) 
} 
  
files <- list.files(pattern = ".txt") 
cl <- makeCluster(detectCores()) 
results <- parLapply(cl, file_list, 
                     extract_data_from_file) 
stopCluster(cl) 
  
# print the results 
print(results) 

Output:

How to Read multiple files parallelly and extract data in R

Read multiple files parallelly using purrr package in R

In this code, the map_df() function from the “purrr” package is used to extract data from each file in the ‘file_list’. The extract_data_from_file() function is applied to each file, and the results are combined into a single data frame using map_df(). Finally, the data is filtered using the ‘dplyr’ package to extract only the rows where the Gender is “Female”. The filtered data is then printed. Note that the ‘map_df’ function is parallelized, meaning that the data extraction process is performed concurrently on multiple files, greatly improving performance.

R

# Import required library 
library(purrr) 
library(dplyr) 
  
# Create a list of file names 
file_list <- c("sample1.txt", 
               "sample2.txt", "sample3.txt") 
path <- "C:/Users/sande/OneDrive/Desktop/files/"
file_list <- paste0(path, file_list) 
  
# Function to extract data from a single file 
extract_data_from_file <- function(file) { 
  data <- read.csv(file, header = TRUE, 
                   stringsAsFactors = FALSE) 
  return(data) 
} 
  
# Use map_df to extract data from 
# multiple files in parallel 
extracted_data_list <- map_df(file_list, 
                              extract_data_from_file) 
  
# Extract data using condition 
filtered_data <- extracted_data_list %>% 
  filter(Gender == "Female") 
  
# Print the results 
print(filtered_data)