Open In App

Joining of Dataframes in R Programming

Last Updated : 23 May, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

rreIn R Language, dataframes are generic data objects which are used to store the tabular data. Dataframes are considered to be the most popular data objects in R programming because it is more comfortable to analyze the data in the tabular form. Dataframes can also be taught as mattresses where each column of a matrix can be of the different data types. Dataframe is made up of three principal components, the data, rows, and columns. In R we use merge() function to merge two dataframes in R. This function is present inside join() function of dplyr package. The most important condition for joining two dataframes is that the column type should be the same on which the merging happens. merge() function works similarly like join in DBMS. Types of Merging Available in R are,

  1. Natural Join or Inner Join
  2. Left Outer Join
  3. Right Outer Join
  4. Full Outer Join
  5. Cross Join
  6. Semi Join
  7. Anti Join

Basic Syntax of merge() function in R:

Syntax: merge(df1, df2, by.df1, by.df2, all.df1, all.df2, sort = TRUE) Parameters: df1: one dataframe df2: another dataframe by.df1, by.df2: The names of the columns that are common to both df1 and df2. all, all.df1, all.df2: Logical values that actually specify the type of merging happens.

Now let’s try to understand all types of merging one by one. First of all, we will create two dataframes that will help us to understand each join easily. # Data frame 1 

Python3




df1 = data.frame(StudentId = c(101:106),
                 Product = c("Hindi", "English",
                             "Maths", "Science",
                             "Political Science",
                             "Physics"))
df1


Output:

    StudentId          Product
1       101             Hindi
2       102           English
3       103             Maths
4       104           Science
5       105 Political Science
6       106           Physics

# Data frame 2 

Python3




df2 = data.frame(StudentId = c(102, 104, 106,
                               107, 108),
                 State = c("Mangalore", "Mysore",
                           "Pune", "Dehradun", "Delhi"))
df2


Output:

   StudentId   State
1       102 Mangalore
2       104   Mysore
3       106     Pune
4       107 Dehradun
5       108    Delhi

Natural Join or Inner Join

Inner join is used to keep only those rows that are matched from the dataframes, in this, we actually specify the argument all = FALSE. If we try to understand this using set theory then we can say here we are actually performing the intersection operation. For example:

A = [1, 2, 3, 4, 5]
B = [2, 3, 5, 6]
Then the output of natural join will be (2, 3, 5)

It is the most simplest and common type of joins available in R. Now let us try to understand this using R program:

 Example: 

Python3




# R program to illustrate
# Joining of dataframes
 
df = merge(x = df1, y = df2, by = "StudentId")
df


Output:

   StudentId Product    State
1       102 English Mangalore
2       104 Science   Mysore
3       106 Physics     Pune

Left Outer Join

Left Outer Join is basically to include all the rows of your dataframe x and only those from y that match, in this, we actually specify the argument x = TRUE. If we try to understand this using a basic set theory then we can say here we are actually displaying complete set x. Now let us try to understand this using R program:

 Example: 

Python3




# R program to illustrate
# Joining of dataframes
 
df = merge(x = df1, y = df2, by = "StudentId",
                                 all.x = TRUE)
df


Output:

  StudentId           Product    State
1       101             Hindi     NA
2       102           English   Mangalore
3       103             Maths     NA
4       104           Science   Mysore
5       105 Political Science     NA
6       106           Physics     Pune

Right Outer Join

Right, Outer Join is basically to include all the rows of your dataframe y and only those from x that match, in this, we actually specify the argument y = TRUE. If we try to understand this using a basic set theory then we can say here we are actually displaying a complete set y. Now let us try to understand this using R program: Example: 

Python3




# R program to illustrate
# Joining of dataframes
 
df = merge(x = df1, y = df2, by = "StudentId",
                                 all.y = TRUE)
df


Output:

  StudentId Product    State
1       102 English Mangalore
2       104 Science   Mysore
3       106 Physics     Pune
4       107    NA    Dehradun
5       108    NA    Delhi

Full Outer Join

Outer Join is basically used to keep all rows from both dataframes, in this, we actually specify the arguments all = TRUE. If we try to understand this using a basic set theory then we can say here we are actually performing the union option. Now let us try to understand this using R program: 

Example: 

Python3




# R program to illustrate
# Joining of dataframes
 
df = merge(x = df1, y = df2, by = "StudentId",
                                   all = TRUE)
df


Output:

   StudentId           Product    State
1       101             Hindi     NA
2       102           English   Mangalore
3       103             Maths     NA
4       104           Science   Mysore
5       105 Political Science     NA
6       106           Physics     Pune
7       107              NA     Dehradun
8       108              NA     Delhi

Cross Join

A Cross Join also known as cartesian join results in every row of one dataframe is being joined to every other row of another dataframe. In set theory, this type of joins is known as the cartesian product between two sets. Now let us try to understand this using R program: 

Example: 

Python3




# R program to illustrate
# Joining of dataframes
 
df = merge(x = df1, y = df2, by = NULL)
df


Output:

StudentId.x           Product StudentId.y    State
1          101             Hindi         102 Mangalore
2          102           English         102 Mangalore
3          103             Maths         102 Mangalore
4          104           Science         102 Mangalore
5          105 Political Science         102 Mangalore
6          106           Physics         102 Mangalore
7          101             Hindi         104   Mysore
8          102           English         104   Mysore
9          103             Maths         104   Mysore
10         104           Science         104   Mysore
11         105 Political Science         104   Mysore
12         106           Physics         104   Mysore
13         101             Hindi         106     Pune
14         102           English         106     Pune
15         103             Maths         106     Pune
16         104           Science         106     Pune
17         105 Political Science         106     Pune
18         106           Physics         106     Pune
19         101             Hindi         107 Dehradun
20         102           English         107 Dehradun
21         103             Maths         107 Dehradun
22         104           Science         107 Dehradun
23         105 Political Science         107 Dehradun
24         106           Physics         107 Dehradun
25         101             Hindi         108    Delhi
26         102           English         108    Delhi
27         103             Maths         108    Delhi
28         104           Science         108    Delhi
29         105 Political Science         108    Delhi
30         106           Physics         108    Delhi

Semi Join

This join is somewhat like inner join, with only the left dataframe columns and values are selected. Now let us try to understand this using R program: 

Example: 

Python3




# R program to illustrate
# Joining of dataframes
 
# Import required library
library(dplyr)
 
df = df1 %>% semi_join(df2, by = "StudentId")
df


Output:

  StudentId Product
1       102 English
2       104 Science
3       106 Physics

Anti Join

In terms of set theory, we can say anti-join as set difference operation, for example, A = (1, 2, 3, 4) B = (2, 3, 5) then the output of A-B will be set (1, 4). This join is somewhat like df1 – df2, as it basically selects all rows from df1 that are actually not present in df2. Now let us try to understand this using R program:

 Example: 

Python3




# R program to illustrate
# Joining of dataframes
 
# Import required library
library(dplyr)
 
df = df1 %>% anti_join(df2, by = "StudentId")
df


Output:

  StudentId           Product
1       101             Hindi
2       103             Maths
3       105 Political Science


Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads