Open In App

How to Apply the Empirical Rule in R

Improve
Improve
Like Article
Like
Save
Share
Report

The empirical rule is a fundamental concept of statistics that states that for a normal distribution, 68% of the data is within one standard deviation of the mean, 95% is within two standard deviations, and 99.7% is within three standard deviations. This rule can be very useful for data analysis, and it can be applied in the R programming language. In this article, we will discuss how to apply the empirical rule in R and provide a few examples to further explain its use.

Empirical Rule

Before we can apply the empirical rule in R, it is important to have an understanding of the fundamental concepts behind it. A normal distribution is a type of data distribution where the mean, median, and mode are all the same. This means that most of the data points are clustered around the mean, with fewer data points the further away you get. The standard deviation is a measure of how spread out the data is in relation to the mean. It is calculated by taking the square root of the variance, which is the sum of the squared distances from each data point to the mean. To understand the Empirical Rule in R Programming Language, it is also important to understand the CDF.

CDF (Cumulative Distribution Function) returns the probability that a random variable will take on a value less than or equal to x. The normal distribution with a mean of 0 and a standard deviation of 1 is often referred to as the standard normal distribution.

R




# Sample data
data <- c(3, 5, 6, 7, 8,
          11, 12, 14, 15, 16)
  
# Mean and standard deviation of the data
mean <- mean(data)
sd <- sd(data)
  
# Probability within 1 SD
z_lower <- (mean - sd - mean) / sd
z_upper <- (mean + sd - mean) / sd
p_1_sd <- pnorm(z_upper) - pnorm(z_lower)
cat('Probability within 1 standard deviation:',
    p_1_sd, '\n')


Output:

Probability within 1 standard deviation: 0.6826895  

This code calculates the mean and standard deviation of a sample data set and then calculates the probability that a data point falls within 1 standard deviation of the mean using the pnorm() function in R.

R




# Sample data
data <- c(3, 5, 6, 7, 8, 11,
          12, 14, 15, 16)
  
# Mean and standard deviation of the data
mean <- mean(data)
sd <- sd(data)
  
# Probability within 2 SD
z_lower <- (mean - 2*sd - mean) / sd
z_upper <- (mean + 2*sd - mean) / sd
p_2_sd <- pnorm(z_upper) - pnorm(z_lower)
cat('Probability within 2 standard deviations:',
    p_2_sd, '\n')


Output:

Probability within 2 standard deviations: 0.9544997

This code computes the probability of the data falling within 2 standard deviations from the mean. It starts by defining a sample data set and then calculates the mean and standard deviation of the data. Next, it uses these values to compute the lower and upper bounds of 2 standard deviations from the mean. Finally, it uses the pnorm() function to calculate the probability of the data falling within these bounds.

R




data <- c(3, 5, 6, 7, 8, 11,
          12, 14, 15, 16)
  
mean <- mean(data)
sd <- sd(data)
  
cdf_3_sd_low <- pnorm(mean - 3 * sd)
cdf_3_sd_high <- pnorm(mean + 3 * sd)
p_3_sd <- cdf_3_sd_high - cdf_3_sd_low
  
cat("Cumulative distribution function within \
     3 standard deviations:", p_3_sd)


Output:

 Cumulative distribution function within 3 standard deviations: 0.9999452

 This code calculates the mean, standard deviation, and cumulative distribution function for three standard deviations, which means the data points are within three standard deviations from the mean. The result will be the cumulative distribution function of these data points.

Calculate the Mean and Standard Deviation of the Data

The first step is to calculate the mean, median, and standard deviation of the data set. R offers various functions to do this, such as the mean(), median(), and sd() functions. Once these values are calculated, we can then use the empirical rule to determine the percentage of data that falls within a certain range.

R




# Calculate mean and standard deviation of data
data <- c(3, 5, 6, 7, 8, 11, 12, 14, 15, 16)
mean <- mean(data)
sd <- sd(data)


Calculate the Boundaries of the Three Intervals

To calculate the percentage of data within one standard deviation of the mean, we simply subtract the mean from the standard deviation and divide it by the standard deviation. For example, if the mean is 10 and the standard deviation is 5, then 68% of the data should be between 5 and 15. To calculate the percentage of data within two standard deviations of the mean, we subtract two times the standard deviation from the mean and divide it by the standard deviation. For example, if the mean is 10 and the standard deviation is 5, then 95% of the data should be between 0 and 20.

R




# Apply empirical rule
# 68% of data within one SD from the mean
low <- mean - sd
high <- mean + sd
low_68_percent <- data[data > low & data < high]
  
# 95% of data within two SD from the mean
low <- mean - (2*sd)
high <- mean + (2*sd)
low_95_percent <- data[data > low & data < high]
  
# 99.7% of data within three SD from the mean
low <- mean - (3*sd)
high <- mean + (3*sd)
low_99_7_percent <- data[data > low & data < high]


Use the proportions to interpret the data according to the Empirical Rule: 

Approximately 68% of all data values should be within one standard deviation of the mean;
Approximately 95% of all data values should be within two standard deviations of the mean;
Approximately 99.7% of all data values should be within three standard deviations of the mean.

R




# Print results
cat('Data within 1 SD from the mean (68%):',
   low_68_percent, '\n')
cat('Data within 2 SD from the mean (95%):',
   low_95_percent, '\n')
cat('Data within 3 SD from the mean (99.7%):',
   low_99_7_percent)


Output:

Data within 1 SD from the mean (68%): 6 7 8 11 12 14 
Data within 2 SD from the mean (95%): 3 5 6 7 8 11 12 14 15 16 
Data within 3 SD from the mean (99.7%): 3 5 6 7 8 11 12 14 15 16

Plotting a Normal Distribution

One way to visualize the empirical rule is by plotting a normal distribution. This will give you a visual representation of the data and how it is distributed around the mean. You can use the ggplot2 library in R to create a histogram of the data and superimpose a normal distribution curve on top. The area under the curve between one and two standard deviations from the mean will represent the percentage of data that falls within those intervals, as described by the empirical rule.

R




# Plot normal distribution
library(ggplot2)
  
data <- rnorm(100, mean = 10, sd = 2)
ggplot(data.frame(data), aes(x=data)) + 
  geom_histogram(binwidth=0.5, 
                 col="black", fill="white") + 
  geom_density(col="red") + 
  xlim(c(6, 14)) + 
  labs(title="Normal Distribution",
       x="Data", y="Frequency")


Output:

Plotting a normal distribution

Plotting a normal distribution

Using the qnorm() function: 

Another way to calculate the percentage of data within a certain interval is to use the qnorm() function. The qnorm() function takes as input the desired percentage of data and the mean and standard deviation of the data and returns the corresponding interval in which that percentage of data falls. For example, to find the interval within which 95% of the data falls, we can call qnorm(0.95, mean = mean, sd = sd).

R




data <- c(1,2,3,4,5)
mean <- mean(data)
sd <- sd(data)
  
interval_68 <- qnorm(0.68, mean = mean,
                     sd = sd)
interval_95 <- qnorm(0.95, mean = mean,
                     sd = sd)
interval_99_7 <- qnorm(0.997, mean = mean,
                       sd = sd)
  
cat("68% of the data falls between",
    interval_68, "\n")
cat("95% of the data falls between",
    interval_95, "\n")
cat("99.7% of the data falls between",
    interval_99_7)


Output:

68% of the data falls between 3.739497 
95% of the data falls between 5.600742 
99.7% of the data falls between 7.344624

Using the ECDF (Empirical Cumulative Distribution Function)

A third way to apply the empirical rule is to use the ECDF, which is a plot of the cumulative frequency of the data. You can use the plot() function in R to create an ECDF plot, and you can use the quantile() function to calculate the percentage of data within a certain interval.

R




# Code
library(ggplot2)
library(dplyr)
  
data <- c(1,2,3,4,5)
data_df <- data.frame(data)
ecdf <- data_df %>%
  mutate(cdf = ecdf(data)(data))
  
ggplot(ecdf, aes(x=data, y=cdf)) + 
  geom_step(col="red") + 
  labs(title="ECDF", x="Data", y="Cumulative Frequency")
  
interval_68 <- quantile(data, c(0.34, 0.66))
interval_95 <- quantile(data, c(0.025, 0.975))
interval_99_7 <- quantile(data, c(0.0025, 0.9975))
  
cat("68% of the data falls between", interval_68, "\n")
cat("95% of the data falls between", interval_95, "\n")
cat("99.7% of the data falls between", interval_99_7)


Output:

68% of the data falls between 2.36 3.64 
95% of the data falls between 1.1 4.9 
99.7% of the data falls between 1.01 4.99
Using the ECDF (Empirical Cumulative Distribution Function)

Using the ECDF (Empirical Cumulative Distribution Function)

This will create a plot with the normal distribution, and areas shaded in red for within 1 standard deviation, blue for within 2 standard deviations, and green for within 3 standard deviations.

R




library(ggplot2)
  
mean <- 0
sd <- 1
  
x <- seq(-4, 4, by = 0.01)
y <- dnorm(x, mean = mean, sd = sd)
  
df <- data.frame(x = x, y = y)
  
ggplot(df, aes(x = x, y = y)) +
  geom_area(data = df,
            aes(x = x,
                y = ifelse(x < mean + 1 * sd & x > mean - 1 * sd,
                           y, 0)),
            fill = "red", alpha = 0.5) +
  geom_area(data = df,
            aes(x = x,
                y = ifelse(x < mean + 2 * sd & x > mean - 2 * sd,
                           y, 0)),
            fill = "blue", alpha = 0.5) +
  geom_area(data = df,
            aes(x = x,
                y = ifelse(x < mean + 3 * sd & x > mean - 3 * sd,
                           y, 0)),
            fill = "green", alpha = 0.5) +
  geom_line(aes(x = x, y = y), size = 1) +
  xlab("x") +
  ylab("density") +
  ggtitle("Normal Distribution") +
  theme_classic()


Output:

Plot the normal distribution and shade the areas within 1, 2, and 3 standard deviations.

Plot the normal distribution and shade the areas within 1, 2, and 3 standard deviations.

To apply the empirical rule in R, calculate the mean and standard deviation of the data set using the mean() and sd() functions. Then use the ‘pnorm()‘ function to calculate the cumulative distribution function for a certain number of standard deviations from the mean. The result will be the cumulative distribution function of the data points within that range, which represents the probability that a random variable will take on a value less than or equal to x.



Last Updated : 16 Mar, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads