Open In App

How to Conduct an Anderson-Darling Test in R

Last Updated : 20 Mar, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

Statistical tests are indispensable tools in data analysis, aiding researchers in determining whether observed data fits a particular distribution. Among these tests, the Anderson-Darling test stands out for its sensitivity to differences in the tails of distributions, making it particularly useful for assessing goodness-of-fit. Here we check the Anderson-Darling test, its significance, and how to conduct it efficiently using the R Programming Language.

Anderson-Darling Test

The Anderson-Darling test is a statistical test used to assess whether a given sample of data comes from a specific distribution, typically comparing it against a normal distribution. Unlike some other goodness-of-fit tests, the Anderson-Darling test is particularly sensitive to deviations in the tails of the distribution, making it useful for detecting differences even in extreme values. The test statistic, denoted as A2, is calculated based on the differences between the observed cumulative distribution function (CDF) and the expected CDF under the null hypothesis.

Syntax

ad_test <- ad.test(data)

Anderson-Darling Test in R

Now we will Conduct an Anderson-Darling Test in R Programming Language and we will use plane price dataset and below is the dataset link to download dataset.

Dataset Link – Plane Price

Performing Anderson-Darling Test on Plane Price

R
# Load necessary libraries
install.packages('nortest')
library(nortest)
library(ggplot2)
library(gridExtra)

# Read the dataset
plane_data <- read.csv("Your/path")

# Choose the variable for analysis (e.g., Price)
price_variable <- plane_data$Price
# Perform Anderson-Darling test
ad_test <- ad.test(price_variable)
# Display test results
print(ad_test)

Output:


Anderson-Darling normality test

data: price_variable
A = 16.756, p-value < 2.2e-16

“Price” variable is chosen for analysis, and the Anderson-Darling test is performed on this variable using the ad.test() function.

  • The test results are printed, indicating a test statistic (A) of 16.756 and a p-value less than 2.2e-16, suggesting strong evidence against the null hypothesis of normality.

Visualizing the Distribution of Price Variable

R
# Plot histogram
histogram <- ggplot(plane_data, aes(x = Price)) +
             geom_histogram(binwidth = 10000, fill = "skyblue", color = "black", 
                            alpha = 0.7) +
             labs(title = "Histogram of Price Variable", x = "Price", y = "Frequency") +
             theme_minimal()

# Plot QQ plot
qq_plot <- ggplot(plane_data, aes(sample = Price)) +
           geom_qq() +
           geom_abline(intercept = mean(price_variable), slope = sd(price_variable), 
                       color = "red") +
           labs(title = "QQ Plot of Price Variable", x = "Theoretical Quantiles", 
                y = "Sample Quantiles") +
           theme_minimal()

# Arrange plots side by side
grid.arrange(histogram, qq_plot, nrow = 1)

Output:

gh

Anderson-Darling Test in R

The histogram displays the distribution of the “Price” variable with the x-axis representing price values and the y-axis representing frequency.

  • The QQ plot compares the quantiles of the “Price” variable against the quantiles of a theoretical normal distribution. Points on the QQ plot should fall approximately along a diagonal line if the data follows a normal distribution.

Performing Anderson-Darling Test on Weather History

Dataset Link :- Weather History

R
# Load necessary libraries
library(ggplot2)
library(gridExtra)

# Read the dataset
weather_data <- read.csv("C:/Users/Tonmoy/Downloads/Dataset/weatherHistory.csv")

# Choose the variable for analysis (e.g., Temperature)
temperature_variable <- weather_data$Temperature..C.

# Perform Anderson-Darling test
ad_test <- ad.test(temperature_variable)

# Display test results
print(ad_test)

Output:

        Anderson-Darling normality test

data: temperature_variable
A = 202.36, p-value < 2.2e-16

Anderson-Darling normality test is performed on the “Temperature” variable from the weather history dataset.

  • The test results indicate a test statistic (A) of 202.36 and a p-value less than 2.2e-16.
  • The small p-value suggests strong evidence against the null hypothesis of normality, indicating that the temperature data significantly deviates from a normal distribution.

Visualizing the Distribution

R
# Plot histogram
histogram <- ggplot(weather_data, aes(x = Temperature..C.)) +
             geom_histogram(binwidth = 1, fill = "skyblue", color = "black", 
                            alpha = 0.7) +
             labs(title = "Histogram of Temperature", x = "Temperature (C)",
                  y = "Frequency") +
             theme_minimal()

# Plot QQ plot
qq_plot <- ggplot(weather_data, aes(sample = Temperature..C.)) +
           geom_qq() +
           geom_abline(intercept = mean(temperature_variable), 
                       slope = sd(temperature_variable), color = "red") +
           labs(title = "QQ Plot of Temperature", x = "Theoretical Quantiles",
                y = "Sample Quantiles") +
           theme_minimal()
# Arrange plots side by side
grid.arrange(histogram, qq_plot, nrow = 1)

Output:

gh

Anderson-Darling Test in R

Histogram of the temperature data is plotted, showing the distribution of temperatures in degrees Celsius.

  • The x-axis represents temperature values, and the y-axis represents the frequency of occurrence.
  • QQ plot is plotted to compare the quantiles of the temperature data against the quantiles of a theoretical normal distribution.
  • Points on the QQ plot should fall approximately along a diagonal line if the data follows a normal distribution.
  • The red line represents the line of equality, indicating the expected distribution if the data were normal.

Advantages

  1. Sensitivity to deviations in the tails of distributions, making it effective for detecting differences in extreme values.
  2. No need to specify parameters for the distribution being tested; it can be applied to a wide range of distributions.
  3. Versatility in assessing goodness-of-fit for various distributions beyond just the normal distribution.

Limitations

  1. The Anderson-Darling test may become overly sensitive with large sample sizes, detecting even small deviations from the specified distribution.
  2. The test is primarily designed for continuous distributions and may not be suitable for discrete distributions.

Conclusion

The Anderson-Darling test is a valuable tool in statistical analysis for assessing the goodness-of-fit of a sample dataset to a specified distribution, with a particular emphasis on detecting differences in extreme values. Through this article, we’ve explored the significance of the Anderson-Darling test, its syntax in R, and demonstrated its application using real-world datasets.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads