Open In App

How to Perform a Chi-Square Goodness of Fit Test in R

Last Updated : 01 Mar, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

The Chi-Square Goodness of Fit Test is a statistical test used to analyze the difference between the observed and expected frequency distribution values in categorical data. This test is popularly used in various domains such as science, biology, business, etc. In this article, we will understand how to perform the chi-square test in the R Programming Language.

What is the Chi-Square Goodness of Fit Test?

The chi-square goodness of fit test is used to measure the significant difference between the expected and observed frequencies under the null hypothesis that there is no difference between the expected and observed frequencies. We can use the formula to calculate the chi-square test mathematically.

[Tex]\chi^2 = \sum_{i} \frac{(O_i – E_i)^2}{E_i} [/Tex]

where,

  • χ2is the Chi-Square statistic
  • Oi is the observed frequency for each category
  • EI is the expected frequency for each category
  • ∑ denotes the sum over all categories.

Calculating the Chi-square goodness of fit test manually in R

We can calculate the chi-square test since we know the mathematical formula for it. In this example, we will create a fictional dataset comparing the frequencies of transportation modes of cities.

R

# Create a fictional dataset
city <- c("City A", "City A", "City A", "City B", "City B", "City B")
transport_mode <- c("Car", "Public Transit", "Bicycle", "Car", "Public Transit",
                    "Bicycle")
observed <- c(40, 30, 20, 35, 25, 15)  # Observed frequencies
expected <- c(35, 30, 20, 40, 25, 15)  # Expected frequencies
 
# Calculate Chi-Square statistic manually
chi_sq_statistic <- sum((observed - expected)^2 / expected)
df <- length(observed) - 1
p_value <- 1 - pchisq(chi_sq_statistic, df)
 
# Print results
print(paste("Chi-Square Statistic:", chi_sq_statistic))
print(paste("Degrees of Freedom:", df))
print(paste("P-value:", p_value))

Output:

[1] "Chi-Square Statistic: 1.33928571428571"

[1] "Degrees of Freedom: 5"

[1] "P-value: 0.930837766731732"

chi- square statistics here is 1.33 which shows the discrepancy between the observed frequencies and the expected frequencies under the null hypothesis. The value is small here so it means there is not much difference.

  • Degrees of Freedom: This shows the number of independent pieces available for estimation. The formula for calculating this is = number of categories -1. Here, 6 categories are present therefore, df will be 5 which is enough to make a decision.
  • P-value: A high p-value suggests that the observed frequencies are consistent with the expected frequencies, and we fail to reject the null hypothesis.

We can also plot the graph to see the difference between the values. To plot graph we need to load “dplyr” package in R programming language.

R

#install packages
install.packages("dplyr")
 
# Create a data frame for plotting
data_plot <- data.frame(Transportation_Mode = transport_mode,
                        Observed = observed,
                        Expected = expected)
 
# Calculate deviations between observed and expected frequencies
data_plot <- data_plot %>%
  mutate(deviation = Observed - Expected)
 
# Plot observed and expected frequencies with deviations
ggplot(data_plot, aes(x = Transportation_Mode, y = Observed, fill = "Observed")) +
  geom_bar(stat = "identity", position = "dodge", width = 0.5) +
  geom_bar(aes(y = Expected, fill = "Expected"), stat = "identity", position = "dodge",
           width = 0.5, alpha = 0.5) +
  geom_errorbar(aes(ymin = pmin(Observed, Expected), ymax = pmax(Observed, Expected),
                    color = "Deviation"),
                width = 0.2, position = position_dodge(width = 0.5)) +
  labs(title = "Observed vs. Expected Frequencies of Transportation Modes",
       y = "Frequency",
       fill = "") +
  scale_fill_manual(values = c("Observed" = "blue", "Expected" = "green"),
                    name = "Category") +
  scale_color_manual(values = "red",
                     name = "Deviation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Output:

gh

Chi-Square Goodness of Fit Test in R

Calculating Chi-square test on an airline dataset using chisq.test() function

In this example, we will use another method to calculate chi-square test in R. For this example, we will use an external dataset from the kaggle website.

Dataset Link: Flight Price Prediction

Make sure you replace the path of the file with the original path in your system.

R

# load dataset
data<- read.csv('path\to\your\file.csv')
 
# Create a contingency table
cont_table <- table(data$airline, data$class)
 
# Perform Chi-Square test
chi_sq_result <- chisq.test(cont_table)
 
# Print the results
print(chi_sq_result)

Output:

Pearson's Chi-squared test

data: cont_table
X-squared = 60493, df = 5, p-value < 2.2e-16

We can also plot these values with the help of ggplot2 library in R

R

# Extract observed and expected frequencies from the contingency table
observed <- as.vector(cont_table)
expected <- chi_sq_result$expected
 
# Create a data frame for plotting
plot_data <- data.frame(
  Category = rep(rownames(cont_table), 2),
  Frequency = c(observed, expected),
  Type = rep(c("Observed", "Expected"), each = nrow(cont_table))
)
 
# Plot the frequencies
library(ggplot2)
 
ggplot(plot_data, aes(x = Category, y = Frequency, fill = Type)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.9), width = 0.7) +
  labs(title = "Observed vs. Expected Frequencies",
       y = "Frequency",
       fill = "Type") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Output:

gh

Chi-Square Goodness of Fit Test in R

As we saw our chi square test shows high discrepancy and our graph shows wide difference between the expected and observed frequencies too.

Calculating Chi-square test using vcd package

We can use ‘vcd’ package available in R to calculate the chi-square test and other statistical values that can help us understanding the dataset better. Here, we are creating a fictional dataset on different age groups of people using different brands of smart phones.

R

# Load necessary packages
library(vcd)
 
# Set the seed for reproducibility
set.seed(123)
 
# Define age groups and smartphone brands
age_groups <- c("Teenager", "Adult", "Senior")
smartphone_brands <- c("Samsung", "Apple", "Xiaomi", "Huawei", "Google")
 
# Generate a fictional dataset
n <- 1000  # Number of observations
age_sample <- sample(age_groups, n, replace = TRUE)
smartphone_sample <- sample(smartphone_brands, n, replace = TRUE)
 
# Convert to factors
age_sample <- factor(age_sample, levels = age_groups)
smartphone_sample <- factor(smartphone_sample, levels = smartphone_brands)
 
# Create a contingency table
cont_table <- table(age_sample, smartphone_sample)
 
# Perform Chi-Square test using assocstats()
chi_sq_result <- assocstats(cont_table)
 
# Print the result
print(chi_sq_result)

Output:

X^2 df P(> X^2)
Likelihood Ratio 10.856 8 0.20998
Pearson 10.961 8 0.20394

Phi-Coefficient : NA
Contingency Coeff.: 0.104
Cramer's V : 0.074

Calculating Chi-Square test using the prop.test() function

In this example we will create a fictional dataset of a drug test and use prop.test() function to get our values.

R

# Treatment outcomes data
success_new_drug <- 45
failure_new_drug <- 15
success_standard_drug <- 30
failure_standard_drug <- 30
 
# Create the 2x2 contingency table
cont_table_2x2 <- matrix(c(success_new_drug, failure_new_drug, success_standard_drug,
                           failure_standard_drug), nrow = 2, byrow = TRUE)
 
# Perform Chi-Square test using prop.test() for proportions
chi_sq_result_2x2 <- prop.test(cont_table_2x2)
 
# Print the result
print(chi_sq_result_2x2)

Output:

2-sample test for equality of proportions with continuity correction

data: cont_table_2x2
X-squared = 6.9689, df = 1, p-value = 0.008294
alternative hypothesis: two.sided
95 percent confidence interval:
0.06596955 0.43403045
sample estimates:
prop 1 prop 2
0.75 0.50

We created a 2×2 contingency table where the rows represent treatment outcomes (success or failure) and the columns represent the two groups (new drug treatment vs. standard drug treatment).
We used the prop.test() function to perform a Chi-Square test for proportions on this 2×2 table.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads