Open In App

How to merge data in R using R merge, dplyr, or data.table

Last Updated : 24 Aug, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

R is a programming language and a software environment that is primarily used for statistic computing and graphics. It provides a wide range of tools for data manipulation, data analysis, data visualization, and statistical modeling.

Merging data is a task in data analysis and data manipulation. In R there are various ways to merge data frames, using the ‘merge()’ function from base R, using the ‘dplyr’ package, and the ‘data.table’ package. In this tutorial, we will use the above three ways to merge data using R.

1. Using ‘merge()’ from base R:

The merge() function in base R helps us to combine two or more data frames based on common columns. It performs various types of joins such as inner join, left join, right join, and full join.

Syntax:

merged_df <- merge(x,y,by = "common_column",..)
  • ‘x’ and ‘y’ are the data frames that you want to merge.
  • ‘by’ specifies the common columns on which the merge will be performed.
  • Additional arguments like ‘all.x’,all.y’ and ‘all’ control the type of join that is to be performed.

Example:

Consider two data frames ‘df1’ and ‘df2’

R




df1 <- data.frame(ID = c(1, 2, 3, 4),
                  Name = c("A", "B", "C", "D"),
                  Age = c(25, 30, 35, 40))
 
df2 <- data.frame(ID = c(2, 3, 4, 5),
                  Occupation = c("Engineer", "Teacher", "Doctor", "Lawyer"),
                  Salary = c(5000, 4000, 6000, 7000))


Let’s see examples of performing various types of joins using the ‘merge()’ function:

1. Inner join (default behavior):

R




inner_join <- merge(df1, df2, by = "ID")
print(inner_join)


Output:

  ID Name Age Occupation Salary
1  2    B  30   Engineer   5000
2  3    C  35    Teacher   4000
3  4    D  40     Doctor   6000

The resulting ‘inner_join’ dataframe will only include the common rows where ‘ID’ is present in both ‘df1’ and ‘df2’.

2. Left join(‘all.x=TRUE’):

R




left_join <- merge(df1, df2, by = "ID", all.x = TRUE)
print (left_join)


Output:

   ID Name Age Occupation Salary
1  1    A  25       <NA>     NA
2  2    B  30   Engineer   5000
3  3    C  35    Teacher   4000
4  4    D  40     Doctor   6000

The resulting ‘left_join’ data frame will include all rows from ‘df1’ and the matching rows from ‘df2’. Non-matching rows from ‘df2’ will have an ‘NA’ value

3. Right join (‘all.y=TRUE’):

R




right_join <- merge(df1, df2, by = "ID", all.y = TRUE)
print(right_join)


Output:

  ID Name Age Occupation Salary
1  2    B  30   Engineer   5000
2  3    C  35    Teacher   4000
3  4    D  40     Doctor   6000
4  5 <NA>  NA     Lawyer   7000

The resulting ‘right_join’ data frame will include all rows from ‘df2’ and the matching rows from ‘df1’. Non-matching rows from ‘df1’ will have ‘NA’ values.

4. Full outer join(‘all =TRUE’)

R




full_join <- merge(df1, df2, by = "ID", all = TRUE)
print(full_join)


Output:

ID Name Age Occupation Salary
1  1    A  25       <NA>     NA
2  2    B  30   Engineer   5000
3  3    C  35    Teacher   4000
4  4    D  40     Doctor   6000
5  5 <NA>  NA     Lawyer   7000

The resulting ‘full_join’ data frame will include all rows from both ‘df1’ and ‘df2’. Non-matching values will have ‘NA’ values.

2. Using ‘dplyr’ package:

The ‘dplyr’ package provides a set of functions for data manipulation, including merging data frames.

The primary function for merging in ‘dplyr’ is ‘join()’, which supports various types of joins.

Syntax:

merged_df<- join(x,y,by="common_column",type="type_of_join")
  • ‘x’ and ‘y’ are the data frames to be merged.
  • ‘by’ specifies the common columns on which the merge is to be performed
  • ‘type_of_join’ can be ‘inner’, ‘left’,’ right’ or ‘full’ to specify the type of join.

Example:
Install the dplyr() package and create two data frames, ‘df1’ and ‘df2’.

R




library(dplyr)
 
df1 <- data.frame(ID = c(1, 2, 3, 4),
                  Name = c("A", "B", "C", "D"),
                  Age = c(20, 30, 40, 50))
 
df2 <- data.frame(ID = c(2, 3, 4, 5),
                  Occupation = c("Engineer", "Teacher", "Doctor", "Lawyer"),
                  Salary = c(2000, 4000, 6000, 7000))


Let’s see examples of performing various types of joins using the ‘dplyr’ functions:

1. Inner join:

R




inner_join <- inner_join(df1, df2, by = "ID")
print(inner_join)


Output:

 ID Name Age Occupation Salary
1  2    B  30   Engineer   2000
2  3    C  40    Teacher   4000
3  4    D  50     Doctor   6000

The resulting ‘inner_join’ data frame will only include the common rows where ‘ID’ is present in both ‘df1’ and ‘df2’.

2. Left join:

R




left_join <- left_join(df1, df2, by = "ID")
print(left_join)


Output:

ID Name Age Occupation Salary
1  1    A  20       <NA>     NA
2  2    B  30   Engineer   2000
3  3    C  40    Teacher   4000
4  4    D  50     Doctor   6000

The resulting ‘left_join’ data frame will include all rows from ‘df1’ and the matching rows from ‘df2’. Non-matching rows from ‘df2’ will have ‘NA’ values.

3. Right join:

R




right_join <- right_join(df1, df2, by = "ID")
print(right_join)


Output:

ID Name Age Occupation Salary
1  2    B  30   Engineer   2000
2  3    C  40    Teacher   4000
3  4    D  50     Doctor   6000
4  5 <NA>  NA     Lawyer   7000

The resulting ‘right_join’ dataframe will include all rows from ‘df2’ and the matching rows from ‘df1’. Non-matching rows of ‘df1’ will have ‘NA’ values.

4. Full outer join:

R




full_join <- full_join(df1, df2, by = "ID")
print(full_join)


Output:

 ID Name Age Occupation Salary
1  1    A  20       <NA>     NA
2  2    B  30   Engineer   2000
3  3    C  40    Teacher   4000
4  4    D  50     Doctor   6000
5  5 <NA>  NA     Lawyer   7000

The resulting ‘full_join’ data frame will include all rows from both ‘df1’ and ‘df2’. Non- matching rows will have ‘NA’ values.

3. Using ‘data.table’ package:

The ‘data.table’ package offers an efficient and fast approach to data manipulation. It provides the ‘merge()’ function. It is similar to the one in base R but optimized for speed.

Syntax:

merged_dt <- merge(x, y, by = "common_column", ...)
  • ‘x’ and ‘y’ are the data frames that are to be merged.
  • ‘by’ specifies the common columns on which the merge will be performed.
  • Additional arguments like ‘all.x’, ‘all.y’ and ‘all’ that controls the type of join.

Example:

Install the data.table library and create two data tables, ‘dt1’ and ‘dt2’.

R




library(data.table)
dt1 <- data.table(ID = c(1, 2, 3, 4),
                  Name = c("A", "B", "C", "D"),
                  Age = c(25, 30, 35, 40))
dt2 <- data.table(ID = c(2, 3, 4, 5),
                  Occupation = c("Engineer", "Teacher", "Doctor", "Lawyer"),
                  Salary = c(5000, 4000, 6000, 7000))


Let’s see examples of performing various types of merges using the ‘merge()’ function from ‘data.table’ package:

1. Inner join( default behaviour):

R




inner_join <- merge(df1, df2, by = "ID")
print(inner_join)


Output:

 ID Name Age Occupation Salary
1  2    B  30   Engineer   2000
2  3    C  40    Teacher   4000
3  4    D  50     Doctor   6000

The resulting ‘inner_join’ data frame will only include the common rows where ‘ID’ is present in both ‘df1’ and ‘df2’.

2. Left join( ‘all.x = TRUE’):

R




left_join <- merge(df1, df2, by = "ID", all.x = TRUE)
print(left_join)


Output:

 ID Name Age Occupation Salary
1  1    A  20       <NA>     NA
2  2    B  30   Engineer   2000
3  3    C  40    Teacher   4000
4  4    D  50     Doctor   6000

The resulting ‘left_join’ data frame will include all Non-matching from ‘df1’ and the matching rows from ‘df2’. Non-matching rows from ‘df2’ will have ‘NA’ values.

3. Right join( ‘all.y = TRUE’):

R




right_join <- merge(df1, df2, by = "ID", all.y = TRUE)
print(right_join)


Output:

  ID Name Age Occupation Salary
1  2    B  30   Engineer   2000
2  3    C  40    Teacher   4000
3  4    D  50     Doctor   6000
4  5 <NA>  NA     Lawyer   7000

The resulting ‘right_join’ data frame will include all Non-matching rows from ‘df2’ and the matching rows from ‘df1’. Non- matching rows from ‘df1’ will have ‘NA’ values.

3. Full outer join( ‘all = TRUE’):

R




full_join <- merge(df1, df2, by = "ID", all = TRUE)
print(full_join)


Output:

 ID Name Age Occupation Salary
1  1    A  20       <NA>     NA
2  2    B  30   Engineer   2000
3  3    C  40    Teacher   4000
4  4    D  50     Doctor   6000
5  5 <NA>  NA     Lawyer   7000

The resulting ‘full_join’ data frame will include all Non-matching rows from both ‘df1’ and ‘df2’. Non- matching rows will have ‘NA’ values.

One of the advantages of using ‘dplyr’ and ‘data.table’ is that they provide a more concise and readable syntax for data manipulation compared to base R.

Summary:

  • ‘merge()’ provides a general-purpose function for merging data frames.
  • ‘dplyr’ focuses on merging with a focus on readability and ease of use.
  • ‘data. table’ provides a fast and efficient way to handle large datasets with optimized performance.
  • We should choose the approach based on the complexity of the task, and performing requirements.


Similar Reads

Single-Table Analysis with dplyr using R Language
The dplyr package is used to perform simulations in the data by performing manipulations and transformations. It can be installed into the working space using the following command : install.packages("dplyr") Let's create the main dataframe: C/C++ Code #installing the required libraries library(dplyr) #creating a data frame data_frame = data.frame(
5 min read
How to Create Frequency Table by Group using Dplyr in R
In this article, we will be looking at the approach to creating a frequency table group with its working examples in the R programming language. Create Frequency Table by Group using dplyr package: In this approach to create the frequency table by group, the user first needs to import and install the dplyr package in the working console, and then t
2 min read
How to Use dplyr to Generate a Frequency Table in R
The frequency table in R is used to create a table with a respective count for both the discrete values and the grouped intervals. It indicates the counts of each segment of the table. It is helpful for constructing the probabilities and drawing an idea about the data distribution. The dplyr package is used to perform simulations in the data by per
3 min read
Filter data by multiple conditions in R using Dplyr
In this article, we will learn how can we filter dataframe by multiple conditions in R programming language using dplyr package. The filter() function is used to produce a subset of the data frame, retaining all rows that satisfy the specified conditions. The filter() method in R programming language can be applied to both grouped and ungrouped dat
3 min read
How to Replace Multiple Values in Data Frame Using dplyr
Replacing multiple values in a data frame involves substituting specific values in one or more columns with new values. This process is often necessary to standardize or clean the data before analysis. In R, the dplyr package offers efficient functions for data manipulation, including mutate() for creating new variables with modified values and cas
2 min read
Joining Data in R with Dplyr Package
In this article, we will be looking at the different methods of joining data with the dplyr in the R programming language. We need to load the dplyr package. Type the below commands - Install - install.packages("dplyr") Load - library("dplyr") Method 1: Using inner join In this method of joining data, the user call the inner_join function, which wi
5 min read
Data Manipulation in R with Dplyr Package
In this article let's discuss manipulating data in the R programming language. In order to manipulate the data, R provides a library called dplyr which consists of many built-in methods to manipulate the data. So to use the data manipulation function, first need to import the dplyr package using library(dplyr) line of code. Below is the list of a f
5 min read
Data Cleaning &amp; Transformation with Dplyr in R
In R, data formatting typically involves preparing and structuring your data in a way that is suitable for analysis or visualization. The exact steps for data formatting may vary depending on your specific dataset and the analysis you want to perform. Here are some common data formatting tasks in the R Programming Language. Formatting data in R is
6 min read
Aggregating and analyzing data with dplyr | R Language
In this article we will we will discuss how we Aggregate and analyze data with dplyr package in the R Programming Language. What is dplyr package in R?The dplyr package is used in R Programming Language to perform simulations in the data by performing manipulations and transformations. It can be installed into the working space using the following
9 min read
Group by one or more variables using Dplyr in R
The group_by() method is used to divide and segregate date based on groups contained within the specific columns. The required column to group by is specified as an argument of this function. It may contain multiple column names. Syntax: group_by(col1, col2, ...) Example 1: Group by one variable C/C++ Code # installing required libraries library(
2 min read