Open In App

Iris dataset in R

Last Updated : 01 Mar, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

The Iris dataset is a classic dataset often used for learning and practicing data analysis and machine learning techniques. The Iris dataset in the R Programming Language is often used for loading the data to build predictive models.

Iris dataset in R

The Iris dataset comprises measurements of iris flowers from three different species: Setosa, Versicolor, and Virginica. Each sample consists of four features: sepal length, sepal width, petal length, and petal width. Additionally, each sample is labeled with its corresponding species.

Dataset Link: Iris Dataset

For Visualization in R, we can use various packages like ggplot2, dplyr, and summary tools for this purpose. Visualizations such as scatter plots, box plots, and histograms help us understand the distribution of each feature and identify potential patterns or outliers.

By loading the libraries required for our analysis. These libraries contain functions and tools that we’ll use later for data manipulation, visualization, and modeling.Also we read the Iris dataset from a CSV file into our R environment. This dataset contains information about the sepal and petal dimensions of different iris flowers, along with their species.

R




# Load necessary libraries
library(randomForest)
library(e1071)
library(class)
library(ggplot2)
library(reshape2)
library(dplyr) # For data manipulation, if needed
# Set working directory
setwd("Your/directory/path")
 
# Load the dataset
iris_data <- read.csv("iris.csv", header = TRUE)


Basically here we check the structure of the dataset , we display the first few rows of the dataset to get an overview of its structure and contents. This helps us understand what kind of data we’re working with.

R




# Display the first few rows of the dataset
head(iris_data)


Output:

  Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm     Species
1 1 5.1 3.5 1.4 0.2 Iris-setosa
2 2 4.9 3.0 1.4 0.2 Iris-setosa
3 3 4.7 3.2 1.3 0.2 Iris-setosa
4 4 4.6 3.1 1.5 0.2 Iris-setosa
5 5 5.0 3.6 1.4 0.2 Iris-setosa
6 6 5.4 3.9 1.7 0.4 Iris-setosa

Check the structure of the dataset

R




str(iris_data)


Output:

'data.frame':    150 obs. of  6 variables:
$ Id : int 1 2 3 4 5 6 7 8 9 10 ...
$ SepalLengthCm: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ SepalWidthCm : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ PetalLengthCm: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ PetalWidthCm : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "Iris-setosa",..: 1 1 1 1 1 1 1 1 1 1 ...

str(iris_data) it’s provides the structure of the dataset. It gives information about the variables (columns) present in the dataset, including their names, data types, and the first few values. It’s particularly useful for understanding the types of variables we’re dealing with, such as numeric, factor, or character.

Generate summary statistics for Iris dataset

R




# Summary statistics
summary(iris_data)


Output:

       Id         SepalLengthCm    SepalWidthCm   PetalLengthCm  
Min. : 1.00 Min. :4.300 Min. :2.000 Min. :1.000
1st Qu.: 38.25 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600
Median : 75.50 Median :5.800 Median :3.000 Median :4.350
Mean : 75.50 Mean :5.843 Mean :3.054 Mean :3.759
3rd Qu.:112.75 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100
Max. :150.00 Max. :7.900 Max. :4.400 Max. :6.900
PetalWidthCm Species
Min. :0.100 Iris-setosa :50
1st Qu.:0.300 Iris-versicolor:50
Median :1.300 Iris-virginica :50
Mean :1.199
3rd Qu.:1.800
Max. :2.500

Now generate summary statistics for the numeric variables in the dataset. These statistics provide us with insights into the central tendency, dispersion, and distribution of the data.

Data Visualization of Iris dataset in R

R




# Boxplot
boxplot(iris_data[, -5], col = c("red", "blue", "green"),
        main = "Boxplot of Iris Dataset")


Output:

Screenshot-2024-02-15-201653

Output

This code generates a boxplot for each numerical variable (columns 1 to 4, excluding the last column) in the iris dataset.

  • The col parameter specifies the colors of the boxplots for each species.
  • The main parameter sets the title of the boxplot.

Histogram

R




# Load the iris dataset
data(iris)
 
# Create a histogram of petal length
hist(iris$Petal.Length,
     main = "Histogram of Petal Length",
     xlab = "Petal Length",
     col = "skyblue",
     border = "black")


Output:

gh

Iris dataset in R

This code generates a histogram for the Petal Length variable.

  • The main parameter sets the title of the histogram.
  • xlab parameter sets the label for the x-axis.
  • col parameter sets the color of the bars in the histogram.

Heatmap

The “iris” dataset, reshapes it into a matrix form, and then plots a heatmap using the `heatmap()` function to visualize the mean petal length across different combinations of sepal length and sepal width.

R




# Subset the data to include only numeric columns
numeric_data <- iris_data[, sapply(iris_data, is.numeric)]
 
# Calculate the correlation matrix
correlation_matrix <- cor(numeric_data)
 
# Create a heatmap of the correlation matrix
heatmap(correlation_matrix,
        main = "Heatmap of Correlation Matrix",
        xlab = "Variables",
        ylab = "Variables",
        col = heat.colors(12),
        symm = TRUE)
 
# Plot the heatmap
heatmap(heatmap_matrix, Rowv = NA, Colv = NA, col = heat.colors(12),
        scale = "column", xlab = "Sepal Length", ylab = "Sepal Width",
        main = "Heatmap of Petal Length")


Output:

gh

Iris dataset in R

First we calculates the correlation matrix for the numeric variables in the iris dataset and then creates a heatmap using the heatmap() function. The color scale is set using the col parameter.

Pairplot

This code will produce a pairplot showing pairwise scatterplots of the variables (Sepal Length, Sepal Width, Petal Length, Petal Width) against each other, with points colored by species.We can see many types of relationships from this plot such as the species Setosa has the smallest of petals widths and lengths. Such information can be gathered about any other species.

R




# Create a pairplot
pairs(iris_data[, 1:4],
      main = "Pairplot of Iris Dataset",
      pch = 19, # Set point character
      col = iris$Species) # Set colors based on species


Output:

gh

Iris dataset in R

Histogram with Distplot

Histogram: The histogram represents the distribution of the “petal_length” variable. It divides the range of values into intervals (bins) and displays the frequency or count of observations falling into each bin using bars. This helps visualize the distribution of petal lengths in the dataset and provides insights into the range and frequency of different petal lengths.

Distribution Plot Overlay: The distribution plot overlay, shown in red, provides a smoothed estimate of the probability density function (PDF) of the “petal_length” variable. It offers additional information about the shape and central tendency of the distribution beyond what the histogram provides. The density plot is a smoothed version of the histogram and gives a sense of the underlying probability distribution of the data.

R




# Load the necessary library
library(ggplot2)
 
# Create a histogram with a distribution plot overlay
ggplot(iris_data, aes(x = PetalLengthCm)) +
  geom_histogram(aes(y = ..density..), fill = "skyblue", color = "black", bins = 30) +
  geom_density(alpha = 0.7, fill = "orange") +
  labs(title = "Histogram with Distribution Plot Overlay",
       x = "Petal Length",
       y = "Density")


Output:

gh

Iris dataset in R

Conclusion

Starting with understanding the dataset’s structure and features, we’ve seen how to check for missing values and outliers, ensuring our analysis is robust. Then, through various visualizations like scatter plots, histograms, and heatmaps, we’ve uncovered interesting patterns and relationships in the data.

Overall, the Iris dataset serves as a great learning resource for beginners in data science. It’s straightforward yet offers plenty of opportunities to practice essential skills like data manipulation, visualization, and analysis.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads