Open In App

Chi-Square Test in R

Last Updated : 19 Dec, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

The chi-square test of independence evaluates whether there is an association between the categories of the two variables. There are basically two types of random variables and they yield two types of data: numerical and categorical. In R Programming Language Chi-square statistics is used to investigate whether distributions of categorical variables differ from one another. The chi-square test is also useful while comparing the tallies or counts of categorical responses between two(or more) independent groups.

In R Programming Language, the function used for performing a chi-square test is chisq.test().

Syntax:

chisq.test(data)

Parameters:

data: data is a table containing count values of the variables in the table.

We will take the survey data in the MASS library which represents the data from a survey conducted on students.

R




# load the MASS package
library(MASS)       
print(str(survey))


Output:

'data.frame':    237 obs. of  12 variables:
$ Sex : Factor w/ 2 levels "Female","Male": 1 2 2 2 2 1 2 1 2 2 ...
$ Wr.Hnd: num 18.5 19.5 18 18.8 20 18 17.7 17 20 18.5 ...
$ NW.Hnd: num 18 20.5 13.3 18.9 20 17.7 17.7 17.3 19.5 18.5 ...
$ W.Hnd : Factor w/ 2 levels "Left","Right": 2 1 2 2 2 2 2 2 2 2 ...
$ Fold : Factor w/ 3 levels "L on R","Neither",..: 3 3 1 3 2 1 1 3 3 3 ...
$ Pulse : int 92 104 87 NA 35 64 83 74 72 90 ...
$ Clap : Factor w/ 3 levels "Left","Neither",..: 1 1 2 2 3 3 3 3 3 3 ...
$ Exer : Factor w/ 3 levels "Freq","None",..: 3 2 2 2 3 3 1 1 3 3 ...
$ Smoke : Factor w/ 4 levels "Heavy","Never",..: 2 4 3 2 2 2 2 2 2 2 ...
$ Height: num 173 178 NA 160 165 ...
$ M.I : Factor w/ 2 levels "Imperial","Metric": 2 1 NA 2 2 1 1 2 2 2 ...
$ Age : num 18.2 17.6 16.9 20.3 23.7 ...
NULL

The above result shows the dataset has many Factor variables which can be considered as categorical variables. For our model, we will consider the variables “Exer” and “Smoke“.The Smoke column records the students smoking habits while the Exer column records their exercise level. Our aim is to test the hypothesis whether the students smoking habit is independent of their exercise level at .05 significance level.

R




# Create a data frame from the main data set.
stu_data = data.frame(survey$Smoke,survey$Exer)
 
# Create a contingency table with the needed variables.          
stu_data = table(survey$Smoke,survey$Exer)
                 
print(stu_data)


Output:

         Freq None Some
Heavy 7 1 3
Never 87 18 84
Occas 12 3 4
Regul 9 1 7

And finally we apply the chisq.test() function to the contingency table stu_data.

R




# applying chisq.test() function
print(chisq.test(stu_data))


Output:

       Pearson's Chi-squared test

data: stu_data
X-squared = 5.4885, df = 6, p-value = 0.4828

As the p-value 0.4828 is greater than the .05, we conclude that the smoking habit is independent of the exercise level of the student and hence there is a weak or no correlation between the two variables. The complete R code is given below.

So, in summary, it can be said that it is very easy to perform a Chi-square test using R. One can perform this task using chisq.test() function in R.

Visualize the Chi-Square Test data

R




# Load required library
library(MASS)
 
# Print structure of the survey dataset
print(str(survey))
 
# Create a data frame for smoking and exercise columns
stu_data <- data.frame(survey$Smoke, survey$Exer)
stu_data <- table(survey$Smoke, survey$Exer)
 
# Print the table
print(stu_data)
 
# Perform the Chi-Square Test
chi_result <- chisq.test(stu_data)
print(chi_result)
 
# Visualize the data with a bar plot
barplot(stu_data, beside = TRUE, col = c("lightblue", "lightgreen"),
        main = "Smoking Habits vs Exercise Levels",
        xlab = "Exercise Level", ylab = "Number of Students")
 
# Add legend separately
legend("center", legend = rownames(stu_data), fill = c("lightblue", "lightgreen"))


Output:

gh

Chi-Square Test in R

In this code we use the MASS library to conduct a Chi-Square Test on the ‘survey’ dataset, focusing on the relationship between smoking habits and exercise levels.

It creates a contingency table, performs the statistical test, and visualizes the data using a bar plot. The legend is added separately to the top-left corner, distinguishing between different smoking habits with distinct colors.

The code aims to explore and communicate the associations between smoking behavior and exercise practices within the dataset.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads