The Kruskal–Wallis test is a rank-based test that is similar to the Mann–Whitney U test but can be applied to one-way data with more than two groups. It is a non-parametric alternative to the one-way ANOVA test, which extends the two-samples Wilcoxon test. A group of data samples is independent if they come from unrelated populations and the samples do not affect each other. Using the Kruskal-Wallis Test, it can be decided whether the population distributions are similar without assuming them to follow the normal distribution. It is very much easy to perform Kruskal-Wallis test in the R language.
Note: The outcome of the Kruskal–Wallis test tells that if there are differences among the groups, but doesn’t tell which groups are different from other groups.
- Let one wants to find out how socioeconomic status influences attitude towards sales tax hikes. Here the independent variable is “socioeconomic status” with three levels: working-class, middle-class, and wealthy. The dependent variable is measured on a 5-point Likert scale from strongly agree to strongly disagree.
- If one wants to find out how test anxiety influences actual test scores. The independent variable “test anxiety” has three levels: no anxiety, low-medium anxiety, and high anxiety. The dependent variable is the exam score and it is rated from 0 to 100%.
Assumptions for the Kruskal-Wallis Test
The variables should have:
- One independent variable with two or more levels. The test is more commonly used when there are three or more levels. For two levels instead of the Kruskal-Wallis test consider using the Mann Whitney U Test.
- The dependent variable should be the Ordinal scale, Ratio Scale, or Interval scale.
- The observations should be independent. In other words, there should be no correlation between the members in every group or within groups.
- All groups should have identical shape distributions.
Implementation in R
R provides a method kruskal.test() which is available in the stats package to perform a Kruskal-Wallis rank-sum test.
kruskal.test(x, g, formula, data, subset, na.action, …)
x: a numeric vector of data values, or a list of numeric data vectors.
g: a vector or factor object giving the group for the corresponding elements of x
formula: a formula of the form response ~ group where response gives the data values and group a vector or factor of the corresponding groups.
data: an optional matrix or data frame containing the variables in the formula formula.
subset: an optional vector specifying a subset of observations to be used.
na.action: a function which indicates what should happen when the data contain NA
…: further arguments to be passed to or from methods.
Let’s use the built-in R data set named PlantGrowth. It contains the weight of plants obtained under control and two different treatment conditions.
weight group 1 4.17 ctrl 2 5.58 ctrl 3 5.18 ctrl 4 6.11 ctrl 5 4.50 ctrl 6 4.61 ctrl 7 5.17 ctrl 8 4.53 ctrl 9 5.33 ctrl 10 5.14 ctrl 11 4.81 trt1 12 4.17 trt1 13 4.41 trt1 14 3.59 trt1 15 5.87 trt1 16 3.83 trt1 17 6.03 trt1 18 4.89 trt1 19 4.32 trt1 20 4.69 trt1 21 6.31 trt2 22 5.12 trt2 23 5.54 trt2 24 5.50 trt2 25 5.37 trt2 26 5.29 trt2 27 4.92 trt2 28 6.15 trt2 29 5.80 trt2 30 5.26 trt2  "ctrl" "trt1" "trt2"
Here the column “group” is called factor and the different categories (“ctr”, “trt1”, “trt2”) are named factor levels. The levels are ordered alphabetically. The problem statement is we want to know if there is any significant difference between the average weights of plants in the 3 experimental conditions. And the test can be performed using the function kruskal.test() as given below.
Kruskal-Wallis rank sum test data: weight by group Kruskal-Wallis chi-squared = 7.9882, df = 2, p-value = 0.01842
As the p-value is less than the significance level 0.05, it can be concluded that there are significant differences between the treatment groups.