Chi-Square Test in R
The chi-square test of independence evaluates whether there is an association between the categories of the two variables. There are basically two types of random variables and they yield two types of data: numerical and categorical. Chi-square statistics is used to investigate whether distributions of categorical variables differ from one another. Chi-square test is also useful while comparing the tallies or counts of categorical responses between two(or more) independent groups.
In R, the function used for performing a chi-square test is chisq.test()
.
Syntax:
chisq.test(data)Parameters:
data: data is a table containing count values of the variables in the table.
Example
We will take the survey data in the MASS
library which represents the data from a survey conducted on students.
# load the MASS package library(MASS) print ( str (survey)) |
Output:
'data.frame': 237 obs. of 12 variables: $ Sex : Factor w/ 2 levels "Female","Male": 1 2 2 2 2 1 2 1 2 2 ... $ Wr.Hnd: num 18.5 19.5 18 18.8 20 18 17.7 17 20 18.5 ... $ NW.Hnd: num 18 20.5 13.3 18.9 20 17.7 17.7 17.3 19.5 18.5 ... $ W.Hnd : Factor w/ 2 levels "Left","Right": 2 1 2 2 2 2 2 2 2 2 ... $ Fold : Factor w/ 3 levels "L on R","Neither",..: 3 3 1 3 2 1 1 3 3 3 ... $ Pulse : int 92 104 87 NA 35 64 83 74 72 90 ... $ Clap : Factor w/ 3 levels "Left","Neither",..: 1 1 2 2 3 3 3 3 3 3 ... $ Exer : Factor w/ 3 levels "Freq","None",..: 3 2 2 2 3 3 1 1 3 3 ... $ Smoke : Factor w/ 4 levels "Heavy","Never",..: 2 4 3 2 2 2 2 2 2 2 ... $ Height: num 173 178 NA 160 165 ... $ M.I : Factor w/ 2 levels "Imperial","Metric": 2 1 NA 2 2 1 1 2 2 2 ... $ Age : num 18.2 17.6 16.9 20.3 23.7 ... NULL
The above result shows the dataset has many Factor variables which can be considered as categorical variables. For our model, we will consider the variables “Exer” and “Smoke“.The Smoke column records the students smoking habits while the Exer column records their exercise level. Our aim is to test the hypothesis whether the students smoking habit is independent of their exercise level at .05 significance level.
# Create a data frame from the main data set. stu_data = data.frame(survey$Smoke,survey$Exer) # Create a contingency table with the needed variables. stu_data = table(survey$Smoke,survey$Exer) print (stu_data) |
Output:
Freq None Some Heavy 7 1 3 Never 87 18 84 Occas 12 3 4 Regul 9 1 7
And finally we apply the chisq.test()
function to the contingency table stu_data.
# applying chisq.test() function print (chisq.test(stu_data)) |
Output:
Pearson's Chi-squared test data: stu_data X-squared = 5.4885, df = 6, p-value = 0.4828
As the p-value 0.4828 is greater than the .05, we conclude that the smoking habit is independent of the exercise level of the student and hence there is a weak or no correlation between the two variables.
The complete R code is given below.
# R program to illustrate # Chi-Square Test in R library(MASS) print ( str (survey)) stu_data = data.frame(survey$Smoke,survey$Exer) stu_data = table(survey$Smoke,survey$Exer) print (stu_data) print (chisq.test(stu_data)) |
So, in summary, it can be said that it is very easy to perform a Chi-square test using R. One can perform this task using chisq.test()
function in R.