Shapiro–Wilk Test in R Programming

The Shapiro-Wilk’s test or Shapiro test is a normality test in frequentist statistics. The null hypothesis of Shapiro’s test is that the population is distributed normally. It is among the three tests for normality designed for detecting all kinds of departure from normality. If the value of p is equal to or less than 0.05, then the hypothesis of normality will be rejected by the Shapiro test. On failing, the test can state that the data will not fit the distribution normally with 95% confidence. However, on passing, the test can state that there exists no significant departure from normality. This test can be done very easily in R programming.

Shapiro-Wilk’s Test Formula

Suppose a sample, say x1,x2…….xn,  has come from a normally distributed population. Then according to the Shapiro-Wilk’s tests null hypothesis test

W=\frac{(\sum_{i=1}^n a_ix_{(i)})^2}{(\sum_{i=1}^n x_i - \bar{x})^2}

where,

  • x(i) : it is the ith smallest number in the given sample.
  • mean(x) : ( x1+x2+……+xn) / n i.e the sample mean.
  • ai : coefficient that can be calculated as (a1,a2,….,an) = (mT V-1)/C . Here V is the covariance matrix, m and C are the vector norms that can be calculated as C= || V-1 m || and m = (m1, m2,……, mn ).

Implementation in R

To perform the Shapiro Wilk Test, R provides shapiro.test() function. 



Syntax:

shapiro.test(x)

Parameter:

x : a numeric vector containing the data values. It allows missing values but the number of missing values should be of the range 3 to 5000. 

Let us see how to perform the Shapiro Wilk’s test step by step.

  • Step 1: At first install the required packages. The two packages that are required to perform the test are dplyr. The dplyr package is needed for efficient data manipulation. One can install the packages from the R console in the following way:
install.packages("dplyr")
  • Step 2: Now load the installed packages into the R Script. It can be done by using the library() function in the following way.

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# loading the package
library(dplyr)

chevron_right


  • Step 3: The most important task is to select a proper data set. Here let’s work with the ToothGrowth data set. It is an in-built data set in the R library.

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# loading the package
library("dplyr")
  
# Using the ToothGrowth data set
# loading the data set
my_data <- ToothGrowth

chevron_right


One can also create their own data set. For that first prepare the data, then save the file and then import the data set into the script. The file can include using the following syntax:

data <- read.delim(file.choose()) ,if the format of the file is .txt
data <- read.csv(file.choose()), if the format of the file is .csv 
  • Step 4: Now select a random number using the set.seed() function. Following which we start displaying an output sample of 10 rows chosen randomly using the sample_n() function of the dplyr package. This is how we check our data.

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# loading the package
library("dplyr")
  
# Using the ToothGrowth package
# loading the data set
my_data <- ToothGrowth
  
# Using the set.seed() for 
# random number generation
set.seed(1234)
  
# Using the sample_n() for 
# random sample of 10 rows
dplyr::sample_n(my_data, 10)

chevron_right


Output:

   len supp dose
1  11.2   VC  0.5
2   8.2   OJ  0.5
3  10.0   OJ  0.5
4  27.3   OJ  2.0
5  14.5   OJ  1.0
6  26.4   OJ  2.0
7   4.2   VC  0.5
8  15.2   VC  1.0
9  14.5   OJ  0.5
10 26.7   VC  2.0
  • Step 5: At last perform the Shapiro Wilk’s test using the shapiro.test() function.

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# loading the package
library("dplyr")
  
# Using the ToothGrowth package
# loading the data set
my_data <- ToothGrowth
  
# Using the set.seed() 
# for random number generation
set.seed(1234)
  
# Using the sample_n() 
# for random sample of 10 rows
dplyr::sample_n(my_data, 10)
  
# Using the shapiro.test() to check
# for normality based 
# on the len parameter
shapiro.test(my_data$len)

chevron_right


Output:

> dplyr::sample_n(my_data, 10)
    len supp dose
1  11.2   VC  0.5
2   8.2   OJ  0.5
3  10.0   OJ  0.5
4  27.3   OJ  2.0
5  14.5   OJ  1.0
6  26.4   OJ  2.0
7   4.2   VC  0.5
8  15.2   VC  1.0
9  14.5   OJ  0.5
10 26.7   VC  2.0
> shapiro.test(my_data$len)

    Shapiro-Wilk normality test

data:  my_data$len
W = 0.96743, p-value = 0.1091

From the output obtained we can assume normality. The p-value is greater than 0.05. Hence, the distribution of the given data is not different from normal distribution significantly.




My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.