Open In App

How to merge data in R using R merge, dplyr, or data.table

Last Updated : 24 Aug, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

R is a programming language and a software environment that is primarily used for statistic computing and graphics. It provides a wide range of tools for data manipulation, data analysis, data visualization, and statistical modeling.

Merging data is a task in data analysis and data manipulation. In R there are various ways to merge data frames, using the ‘merge()’ function from base R, using the ‘dplyr’ package, and the ‘data.table’ package. In this tutorial, we will use the above three ways to merge data using R.

1. Using ‘merge()’ from base R:

The merge() function in base R helps us to combine two or more data frames based on common columns. It performs various types of joins such as inner join, left join, right join, and full join.

Syntax:

merged_df <- merge(x,y,by = "common_column",..)
  • ‘x’ and ‘y’ are the data frames that you want to merge.
  • ‘by’ specifies the common columns on which the merge will be performed.
  • Additional arguments like ‘all.x’,all.y’ and ‘all’ control the type of join that is to be performed.

Example:

Consider two data frames ‘df1’ and ‘df2’

R




df1 <- data.frame(ID = c(1, 2, 3, 4),
                  Name = c("A", "B", "C", "D"),
                  Age = c(25, 30, 35, 40))
 
df2 <- data.frame(ID = c(2, 3, 4, 5),
                  Occupation = c("Engineer", "Teacher", "Doctor", "Lawyer"),
                  Salary = c(5000, 4000, 6000, 7000))


Let’s see examples of performing various types of joins using the ‘merge()’ function:

1. Inner join (default behavior):

R




inner_join <- merge(df1, df2, by = "ID")
print(inner_join)


Output:

  ID Name Age Occupation Salary
1  2    B  30   Engineer   5000
2  3    C  35    Teacher   4000
3  4    D  40     Doctor   6000

The resulting ‘inner_join’ dataframe will only include the common rows where ‘ID’ is present in both ‘df1’ and ‘df2’.

2. Left join(‘all.x=TRUE’):

R




left_join <- merge(df1, df2, by = "ID", all.x = TRUE)
print (left_join)


Output:

   ID Name Age Occupation Salary
1  1    A  25       <NA>     NA
2  2    B  30   Engineer   5000
3  3    C  35    Teacher   4000
4  4    D  40     Doctor   6000

The resulting ‘left_join’ data frame will include all rows from ‘df1’ and the matching rows from ‘df2’. Non-matching rows from ‘df2’ will have an ‘NA’ value

3. Right join (‘all.y=TRUE’):

R




right_join <- merge(df1, df2, by = "ID", all.y = TRUE)
print(right_join)


Output:

  ID Name Age Occupation Salary
1  2    B  30   Engineer   5000
2  3    C  35    Teacher   4000
3  4    D  40     Doctor   6000
4  5 <NA>  NA     Lawyer   7000

The resulting ‘right_join’ data frame will include all rows from ‘df2’ and the matching rows from ‘df1’. Non-matching rows from ‘df1’ will have ‘NA’ values.

4. Full outer join(‘all =TRUE’)

R




full_join <- merge(df1, df2, by = "ID", all = TRUE)
print(full_join)


Output:

ID Name Age Occupation Salary
1  1    A  25       <NA>     NA
2  2    B  30   Engineer   5000
3  3    C  35    Teacher   4000
4  4    D  40     Doctor   6000
5  5 <NA>  NA     Lawyer   7000

The resulting ‘full_join’ data frame will include all rows from both ‘df1’ and ‘df2’. Non-matching values will have ‘NA’ values.

2. Using ‘dplyr’ package:

The ‘dplyr’ package provides a set of functions for data manipulation, including merging data frames.

The primary function for merging in ‘dplyr’ is ‘join()’, which supports various types of joins.

Syntax:

merged_df<- join(x,y,by="common_column",type="type_of_join")
  • ‘x’ and ‘y’ are the data frames to be merged.
  • ‘by’ specifies the common columns on which the merge is to be performed
  • ‘type_of_join’ can be ‘inner’, ‘left’,’ right’ or ‘full’ to specify the type of join.

Example:
Install the dplyr() package and create two data frames, ‘df1’ and ‘df2’.

R




library(dplyr)
 
df1 <- data.frame(ID = c(1, 2, 3, 4),
                  Name = c("A", "B", "C", "D"),
                  Age = c(20, 30, 40, 50))
 
df2 <- data.frame(ID = c(2, 3, 4, 5),
                  Occupation = c("Engineer", "Teacher", "Doctor", "Lawyer"),
                  Salary = c(2000, 4000, 6000, 7000))


Let’s see examples of performing various types of joins using the ‘dplyr’ functions:

1. Inner join:

R




inner_join <- inner_join(df1, df2, by = "ID")
print(inner_join)


Output:

 ID Name Age Occupation Salary
1  2    B  30   Engineer   2000
2  3    C  40    Teacher   4000
3  4    D  50     Doctor   6000

The resulting ‘inner_join’ data frame will only include the common rows where ‘ID’ is present in both ‘df1’ and ‘df2’.

2. Left join:

R




left_join <- left_join(df1, df2, by = "ID")
print(left_join)


Output:

ID Name Age Occupation Salary
1  1    A  20       <NA>     NA
2  2    B  30   Engineer   2000
3  3    C  40    Teacher   4000
4  4    D  50     Doctor   6000

The resulting ‘left_join’ data frame will include all rows from ‘df1’ and the matching rows from ‘df2’. Non-matching rows from ‘df2’ will have ‘NA’ values.

3. Right join:

R




right_join <- right_join(df1, df2, by = "ID")
print(right_join)


Output:

ID Name Age Occupation Salary
1  2    B  30   Engineer   2000
2  3    C  40    Teacher   4000
3  4    D  50     Doctor   6000
4  5 <NA>  NA     Lawyer   7000

The resulting ‘right_join’ dataframe will include all rows from ‘df2’ and the matching rows from ‘df1’. Non-matching rows of ‘df1’ will have ‘NA’ values.

4. Full outer join:

R




full_join <- full_join(df1, df2, by = "ID")
print(full_join)


Output:

 ID Name Age Occupation Salary
1  1    A  20       <NA>     NA
2  2    B  30   Engineer   2000
3  3    C  40    Teacher   4000
4  4    D  50     Doctor   6000
5  5 <NA>  NA     Lawyer   7000

The resulting ‘full_join’ data frame will include all rows from both ‘df1’ and ‘df2’. Non- matching rows will have ‘NA’ values.

3. Using ‘data.table’ package:

The ‘data.table’ package offers an efficient and fast approach to data manipulation. It provides the ‘merge()’ function. It is similar to the one in base R but optimized for speed.

Syntax:

merged_dt <- merge(x, y, by = "common_column", ...)
  • ‘x’ and ‘y’ are the data frames that are to be merged.
  • ‘by’ specifies the common columns on which the merge will be performed.
  • Additional arguments like ‘all.x’, ‘all.y’ and ‘all’ that controls the type of join.

Example:

Install the data.table library and create two data tables, ‘dt1’ and ‘dt2’.

R




library(data.table)
dt1 <- data.table(ID = c(1, 2, 3, 4),
                  Name = c("A", "B", "C", "D"),
                  Age = c(25, 30, 35, 40))
dt2 <- data.table(ID = c(2, 3, 4, 5),
                  Occupation = c("Engineer", "Teacher", "Doctor", "Lawyer"),
                  Salary = c(5000, 4000, 6000, 7000))


Let’s see examples of performing various types of merges using the ‘merge()’ function from ‘data.table’ package:

1. Inner join( default behaviour):

R




inner_join <- merge(df1, df2, by = "ID")
print(inner_join)


Output:

 ID Name Age Occupation Salary
1  2    B  30   Engineer   2000
2  3    C  40    Teacher   4000
3  4    D  50     Doctor   6000

The resulting ‘inner_join’ data frame will only include the common rows where ‘ID’ is present in both ‘df1’ and ‘df2’.

2. Left join( ‘all.x = TRUE’):

R




left_join <- merge(df1, df2, by = "ID", all.x = TRUE)
print(left_join)


Output:

 ID Name Age Occupation Salary
1  1    A  20       <NA>     NA
2  2    B  30   Engineer   2000
3  3    C  40    Teacher   4000
4  4    D  50     Doctor   6000

The resulting ‘left_join’ data frame will include all Non-matching from ‘df1’ and the matching rows from ‘df2’. Non-matching rows from ‘df2’ will have ‘NA’ values.

3. Right join( ‘all.y = TRUE’):

R




right_join <- merge(df1, df2, by = "ID", all.y = TRUE)
print(right_join)


Output:

  ID Name Age Occupation Salary
1  2    B  30   Engineer   2000
2  3    C  40    Teacher   4000
3  4    D  50     Doctor   6000
4  5 <NA>  NA     Lawyer   7000

The resulting ‘right_join’ data frame will include all Non-matching rows from ‘df2’ and the matching rows from ‘df1’. Non- matching rows from ‘df1’ will have ‘NA’ values.

3. Full outer join( ‘all = TRUE’):

R




full_join <- merge(df1, df2, by = "ID", all = TRUE)
print(full_join)


Output:

 ID Name Age Occupation Salary
1  1    A  20       <NA>     NA
2  2    B  30   Engineer   2000
3  3    C  40    Teacher   4000
4  4    D  50     Doctor   6000
5  5 <NA>  NA     Lawyer   7000

The resulting ‘full_join’ data frame will include all Non-matching rows from both ‘df1’ and ‘df2’. Non- matching rows will have ‘NA’ values.

One of the advantages of using ‘dplyr’ and ‘data.table’ is that they provide a more concise and readable syntax for data manipulation compared to base R.

Summary:

  • ‘merge()’ provides a general-purpose function for merging data frames.
  • ‘dplyr’ focuses on merging with a focus on readability and ease of use.
  • ‘data. table’ provides a fast and efficient way to handle large datasets with optimized performance.
  • We should choose the approach based on the complexity of the task, and performing requirements.


Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads