Open In App

Pair plot from scratch with tidyverse

Last Updated : 24 Aug, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

In R Programming Language Pairplot is a matrix of plots that are used to show the relationship between each of the pairs of variables in a given dataset.
The pairs() function in R is used to create pair plots specifically scatter plots, the syntax is:-

pairs(dataset)

where dataset parameter is the name of the data frame. The function returns a matrix of scatter plots between each pair of variables in the data frame.

Pairplots can be used to get a sense of the distribution of different variables of the dataset and helps us to identify any potential problems or patterns in a given dataset.

Pair Plot Using Tidyverse in R

Tidyverse package in R is a collection of R packages designed for Exploratory data analysis, visualization, and manipulation in R. It contains R packages namely ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats and many others. The complete list can be obtained as shown in the below code.

R




# Install the tidyverse package
  install.packages("tidyverse")
 
# Load the package
  library(tidyverse)
 
# List all packages in tidyverse
  tidyverse_packages()


Output:

 [1] "broom"         "conflicted"    "cli"           "dbplyr"        "dplyr"        
 [6] "dtplyr"        "forcats"       "ggplot2"       "googledrive"   "googlesheets4"
[11] "haven"         "hms"           "httr"          "jsonlite"      "lubridate"    
[16] "magrittr"      "modelr"        "pillar"        "purrr"         "ragg"         
[21] "readr"         "readxl"        "reprex"        "rlang"         "rstudioapi"   
[26] "rvest"         "stringr"       "tibble"        "tidyr"         "xml2"         
[31] "tidyverse" 

So tidyverse package actually initializes all the needed packages at once instead of having to initialize them in our code one by one which is time taking and inefficient.

Let’s assume a preloaded dataset in R, the chickwts dataset preloaded in R which contains the weight of chickens according to their feeding habit.

R




# Printing the available columns in dataset
print(colnames(chickwts))
 
# Total count of columns
print(paste("Total no. of columns : ", length(colnames(chickwts))))


Output :

[1] "weight" "feed"
[1] "Total no. of columns :  2"

Hence total no. of pair plots created will be 2 if we perform a simple pair-plot using the simple pairs() function in R.

R




# Simple scatter pair plot
 pairs(chickwts)


Output:

Screenshot-(639)

Simple scatter pair-plot

Now we will try to perform the same using the tidyverse packages to get a more clean look to the plot.

Step 1: Preparing the Dataset

First, we need to clean our dataset of NA(Not Available) values by removing the rows containing NA in any of the columns. We will use the drop_na() of the tidy package to achieve the same.

R




library(tidyverse)
 
# Storing our chickwts dataset in a different variable
 chickwts_df <- chickwts
 
# Checking for existence of NA values and removing these rows
if (sum(is.na(chickwts_df)) > 0) {
  chickwts_df <- chickwts_df %>% drop_na()
  print("Cleaned the dataset of NA values")
} else {
  print("No NA values found.")
}


Output:

[1] "No NA values found."

The is. na() function produces a matrix consisting of logical values (TRUE or FALSE), where TRUE indicates a missing value in a particular dataset column. The sum() function is used to take a sum of all those logical values and since TRUE corresponds to 1 any missing value will result in a sum > 0. We finally used the drop_na() function to clean our dataset. `%>%` is the pipe operator of dplyr package included in tidyverse to pipe two or more operations together.

Step 2: Create the pair plot

We will use the GGally package which extends the ggplot2 package and provides a function named ggpairs() which is the ggplot2 equivalent of the pairs() the function of base R.  We can see the correlation coefficient between each pairwise combination of variables as well as a density plot for each individual variable and hence its better than pair() function.

R




library(tidyverse)
library(GGally)
 
# Storing our chickwts dataset in a different variable
chickwts_df <- chickwts
 
# Checking for existence of NA values and removing these rows
if (sum(is.na(chickwts_df)) > 0) {
  chickwts_df <- chickwts_df %>% drop_na()
  print("Cleaned the dataset of NA values")
} else {
  print("No NA values found.")
}
 
# pair plot using ggpairs
ggpairs(chickwts_df, columns = c("weight", "feed"))


Output:

ing

Pair plot from scratch with tidyverse

Here inside ggpairs() function we specify the dataset we want to create a pair plot of and the columns to be used in the pair-plot.

It has generated 4 plots namely scatter plot(top left), box plot (top-right), density plot(bottom-left), and histogram(bottom-right).

Now let’s improve the looks
first, we will add colors to the plot by setting the color parameter in the aesthetics of our plot to the `feed` column and then add the `alpha = 0.5` parameter to denote a semi-transparent plot to get a better view of overlapping layers.

R




library(tidyverse)
library(GGally)
 
# Storing our chickwts dataset in a different variable
chickwts_df <- chickwts
 
# Checking for existence of NA values and removing these rows
if (sum(is.na(chickwts_df)) > 0) {
  chickwts_df <- chickwts_df %>% drop_na()
  print("Cleaned the dataset of NA values")
} else {
  print("No NA values found.")
}
 
# pair plot using ggpairs
ggpairs(chickwts_df, columns = c("weight", "feed"), aes(color = feed, alpha = 0.5))


Output :

Screenshot-(641)

Pair Plot from scratch with tidyverse

Now finally we will add a legend to our plot and set a theme to the plot.

R




library(tidyverse)
library(GGally)
 
# Storing our chickwts dataset in a different variable
chickwts_df <- chickwts
 
# pair plot using ggpairs
ggpairs(chickwts_df, columns = c("weight", "feed"), aes(color = feed,
              alpha = 0.5), legend = 1) + theme_bw()


Output:

ing

Pair Plot from scratch with tidyverse

Here we added a legend by writing `legend = 1` which enables legend to be displayed, the default value is NULL which means no legend. The plot displays a legend containing all parameters mentioned in aes() function. We set the plot to the theme_bw() function to get a simple theme with deep black borders around the plot.

Example 2:

Now that we have created a basic pairplot using tidyverse, let’s try to plot another dataset. We will use the preloaded dataset in R `air quality` containing New York’s Air quality measurements.
You are viewing
Step 1:-

We are viewing the dataset and checking for NA values in the dataset.

R




# Viewing the first 10 rows of the airquality dataset
 head(airquality, 10)
 
# Printing the total number of rows in the dataset
 rows <- nrow(airquality)
 cat("Total number of rows in the dataset : ", rows, "\n")
 
# Print all the different column names in dataset
 col_names <- names(airquality)
 cat("Column names : ", paste(col_names, collapse = ", "), "\n")
 
# Print the total number of rows containing NA values
 na_rows <- sum(rowSums(is.na(airquality)) > 0)
 cat("Number of rows containing NA values : ", na_rows, "\n")


Output :

Total number of rows in the dataset :  153 

Column names :  Ozone, Solar.R, Wind, Temp, Month, Day 

Number of rows containing NA values :  42 

In the above head() function we have displayed NA values and even some rows contain NA values in multiple columns, which has been confirmed by counting such later.
In counting the number of rows containing at least a NA value, `rowSums(is.na(airquality)` calculates the sum of NA values for each row and stores it in a numeric vector where each element represents the number of NA values in each row. The numeric vector for the air quality dataset will look like this.

Step 2:-

Cleaning the dataset by removing rows containing NA values using the drop_na() function of tidyr package.

R




library(tidyverse)
 
# Remove rows with any NA values and store it in a new variable
 airquality_cleaned <- airquality %>% drop_na()
 
# Print the total number of rows in the dataset
 total_rows <- nrow(airquality_cleaned)
 cat("Total number of rows in the dataset after cleaning : ", total_rows, "\n")
 
# Rechecking for NA values for clarity
 na_rows <- sum(rowSums(is.na(airquality_cleaned)) > 0)
 cat("Number of rows containing NA values now : ", na_rows, "\n")


Output :

Total number of rows in the dataset after cleaning :  111

Number of rows containing NA values now :  0

The output indeed matches our normal math calculations(i.e. 153 – 42 = 111). Hence we have completely cleaned our dataframe.

Step 3:-

Now we need to plot the cleaned data frame. We will assume all columns except the Day column.

Method 1: Using simple pairs() function

R




library(tidyverse)
 
# Remove rows with any NA values and store it in a new variable
airquality_cleaned <- airquality %>% drop_na()
 
# Create a colorful scatter pair-plot
pairs(airquality_cleaned[, c("Ozone", "Solar.R", "Wind", "Temp", "Month")],
      col = airquality_cleaned$Month, pch=20, oma=c(3,3,3,15))
 
# Plot the legend outside of default plotting area
par(xpd = TRUE)
 
# Set up legend on right white space of plot
legend("right", legend = unique(airquality_cleaned$Month),
       col = unique(airquality_cleaned$Month), pch = 20, title = "Month")


Output:

Screenshot-(644)

scatter pair-plot of air quality dataframe

In the above code, in the first parameter of the pairs() function, we have selected all rows of cleaned airquality dataframe with only 5 columns mentioning them in the air-quality array. The col parameter is used to set colors to the plot, here we have set it the Month’s value. `pch` stands for the plot character, by default it’s an empty circle(i.e. value = 1), we set it to 20 which means a bullet (smaller filled circle). oma parameter sets the outer margins of a plot(bottom, left, top, and right respectively) and will help us in making space for our legend on the right.

The function par() containing `xpd = TRUE` ensures that legend is plotted outside the default plotting area avoiding overlapping the legend on the plot.

In the legend() function, we curate the looks and aesthetics of our legend. It is positioned to the right side of our plot, and will contain unique Month values(passed to the legend parameter), colors for each such value will be according to unique values of Month(passed to the col parameter), set the marker as bullets using pch parameter and setting the title of our legend to “Month”.

Method 2: Using ggpairs() function

Now we will try to perform a similar plot with the more advanced function ggpairs() and see the results.

R




library(tidyverse)
library(GGally)
 
# Remove rows with any NA values and store it in a new variable
airquality_cleaned <- airquality %>% drop_na()
 
# Convert the Month column into a factor to avoid error while plot
airquality_cleaned$Month <- factor(airquality_cleaned$Month)
 
# Create a colorful pair plot
ggpairs(airquality_cleaned[, c("Ozone", "Solar.R", "Wind", "Temp", "Month")],
        mapping = aes(color = Month, alpha = 0.5), legend = 1)


Output:

ing

Pair Plot from scratch with tidyverse

Similar to the pairs() function implementation we pass the cleaned air quality dataframe in the same format. In the mapping parameter, we set the color to vary with the Months column and set alpha to 0.5 meaning that colors will be semi-transparent so as to view overlapping layers. Finally, we set the legend to be visible in the plot.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads