Joining of Dataframes in R Programming
Last Updated :
23 May, 2022
rreIn R Language, dataframes are generic data objects which are used to store the tabular data. Dataframes are considered to be the most popular data objects in R programming because it is more comfortable to analyze the data in the tabular form. Dataframes can also be taught as mattresses where each column of a matrix can be of the different data types. Dataframe is made up of three principal components, the data, rows, and columns. In R we use merge() function to merge two dataframes in R. This function is present inside join() function of dplyr package. The most important condition for joining two dataframes is that the column type should be the same on which the merging happens. merge() function works similarly like join in DBMS. Types of Merging Available in R are,
- Natural Join or Inner Join
- Left Outer Join
- Right Outer Join
- Full Outer Join
- Cross Join
- Semi Join
- Anti Join
Basic Syntax of merge() function in R:
Syntax: merge(df1, df2, by.df1, by.df2, all.df1, all.df2, sort = TRUE) Parameters: df1: one dataframe df2: another dataframe by.df1, by.df2: The names of the columns that are common to both df1 and df2. all, all.df1, all.df2: Logical values that actually specify the type of merging happens.
Now let’s try to understand all types of merging one by one. First of all, we will create two dataframes that will help us to understand each join easily. # Data frame 1
Python3
df1 = data.frame(StudentId = c( 101 : 106 ),
Product = c("Hindi", "English",
"Maths", "Science",
"Political Science",
"Physics"))
df1
|
Output:
StudentId Product
1 101 Hindi
2 102 English
3 103 Maths
4 104 Science
5 105 Political Science
6 106 Physics
# Data frame 2
Python3
df2 = data.frame(StudentId = c( 102 , 104 , 106 ,
107 , 108 ),
State = c("Mangalore", "Mysore",
"Pune", "Dehradun", "Delhi"))
df2
|
Output:
StudentId State
1 102 Mangalore
2 104 Mysore
3 106 Pune
4 107 Dehradun
5 108 Delhi
Natural Join or Inner Join
Inner join is used to keep only those rows that are matched from the dataframes, in this, we actually specify the argument all = FALSE. If we try to understand this using set theory then we can say here we are actually performing the intersection operation. For example:
A = [1, 2, 3, 4, 5]
B = [2, 3, 5, 6]
Then the output of natural join will be (2, 3, 5)
It is the most simplest and common type of joins available in R. Now let us try to understand this using R program:
Example:
Python3
df = merge(x = df1, y = df2, by = "StudentId")
df
|
Output:
StudentId Product State
1 102 English Mangalore
2 104 Science Mysore
3 106 Physics Pune
Left Outer Join
Left Outer Join is basically to include all the rows of your dataframe x and only those from y that match, in this, we actually specify the argument x = TRUE. If we try to understand this using a basic set theory then we can say here we are actually displaying complete set x. Now let us try to understand this using R program:
Example:
Python3
df = merge(x = df1, y = df2, by = "StudentId",
all .x = TRUE)
df
|
Output:
StudentId Product State
1 101 Hindi NA
2 102 English Mangalore
3 103 Maths NA
4 104 Science Mysore
5 105 Political Science NA
6 106 Physics Pune
Right Outer Join
Right, Outer Join is basically to include all the rows of your dataframe y and only those from x that match, in this, we actually specify the argument y = TRUE. If we try to understand this using a basic set theory then we can say here we are actually displaying a complete set y. Now let us try to understand this using R program: Example:
Python3
df = merge(x = df1, y = df2, by = "StudentId",
all .y = TRUE)
df
|
Output:
StudentId Product State
1 102 English Mangalore
2 104 Science Mysore
3 106 Physics Pune
4 107 NA Dehradun
5 108 NA Delhi
Full Outer Join
Outer Join is basically used to keep all rows from both dataframes, in this, we actually specify the arguments all = TRUE. If we try to understand this using a basic set theory then we can say here we are actually performing the union option. Now let us try to understand this using R program:
Example:
Python3
df = merge(x = df1, y = df2, by = "StudentId",
all = TRUE)
df
|
Output:
StudentId Product State
1 101 Hindi NA
2 102 English Mangalore
3 103 Maths NA
4 104 Science Mysore
5 105 Political Science NA
6 106 Physics Pune
7 107 NA Dehradun
8 108 NA Delhi
Cross Join
A Cross Join also known as cartesian join results in every row of one dataframe is being joined to every other row of another dataframe. In set theory, this type of joins is known as the cartesian product between two sets. Now let us try to understand this using R program:
Example:
Python3
df = merge(x = df1, y = df2, by = NULL)
df
|
Output:
StudentId.x Product StudentId.y State
1 101 Hindi 102 Mangalore
2 102 English 102 Mangalore
3 103 Maths 102 Mangalore
4 104 Science 102 Mangalore
5 105 Political Science 102 Mangalore
6 106 Physics 102 Mangalore
7 101 Hindi 104 Mysore
8 102 English 104 Mysore
9 103 Maths 104 Mysore
10 104 Science 104 Mysore
11 105 Political Science 104 Mysore
12 106 Physics 104 Mysore
13 101 Hindi 106 Pune
14 102 English 106 Pune
15 103 Maths 106 Pune
16 104 Science 106 Pune
17 105 Political Science 106 Pune
18 106 Physics 106 Pune
19 101 Hindi 107 Dehradun
20 102 English 107 Dehradun
21 103 Maths 107 Dehradun
22 104 Science 107 Dehradun
23 105 Political Science 107 Dehradun
24 106 Physics 107 Dehradun
25 101 Hindi 108 Delhi
26 102 English 108 Delhi
27 103 Maths 108 Delhi
28 104 Science 108 Delhi
29 105 Political Science 108 Delhi
30 106 Physics 108 Delhi
Semi Join
This join is somewhat like inner join, with only the left dataframe columns and values are selected. Now let us try to understand this using R program:
Example:
Python3
library(dplyr)
df = df1 % > % semi_join(df2, by = "StudentId")
df
|
Output:
StudentId Product
1 102 English
2 104 Science
3 106 Physics
Anti Join
In terms of set theory, we can say anti-join as set difference operation, for example, A = (1, 2, 3, 4) B = (2, 3, 5) then the output of A-B will be set (1, 4). This join is somewhat like df1 – df2, as it basically selects all rows from df1 that are actually not present in df2. Now let us try to understand this using R program:
Example:
Python3
library(dplyr)
df = df1 % > % anti_join(df2, by = "StudentId")
df
|
Output:
StudentId Product
1 101 Hindi
2 103 Maths
3 105 Political Science
Like Article
Suggest improvement
Share your thoughts in the comments
Please Login to comment...