Regularized Discriminant Analysis

Regularized Discriminant analysis

Linear Discriminant analysis and QDA work straightforwardly for cases where a number of observations is far greater than the number of predictors n>p. In these situations, it offers very advantages such as ease to apply (Since we don’t have to calculate the covariance for each class) and robustness to the deviations of model assumptions.

However, the use of LDA becomes a serious challenge when used in for example the number of observations is less than predictors such as micro array settings because there are two challenges here

The sample covariance matrix is singular and cannot be inverted.
The high dimensionality makes the direct matrix operation formidable, hence hindering the applicability of this method.

Therefore, we will make some changes in the LDA and QDA, i.e we form a new covariance matrix that combines the covariance matrix of LDA () and QDA () using a tuning parameter

However, some version of regularized discriminant analysis uses another parameter () with the following equation:

RDA limits the separate covariance of QDA towards the common covariance of LDA. This improves the estimates the covariance matrix in situations where the number of predictors is larger than the number of samples in the training data leading to improvement in the model accuracy.

In the above equation, the equation \gamma and \lambda both have values b/w 0 and 1. Now, for all the four boundary values, it produces a special equation case for each one. Let’s look at these special cases:

the covariance of QDA i.e the individual covariance of each group.
the covariance of LDA, i.e a common covariance matrix.
conditional independent variance.
Classification using Euclidean distance similar to the previous case, but variances are the same for all groups.

Implementation

In this implementation, we will perform Regularized discriminant Analysis. We will use the klaR library and the rda function in it. We also use the iris dataset.

# imports

library(tidyverse)

library(MASS)

library(klaR)
 

data('iris')
# model 
# divide the data into train and test 

train_test.samples <- iris$Species %>% createDataPartition(p = 0.8, list = FALSE)
train.data <- iris[train_test.samples, ]
test.data <- iris[-train_test.samples, ]
 
# Data preprocessing
# Normalize the different parameters of dataset and categorical
# variables also includes
preproc.param <- train.data %>% 

  preProcess(method = c("center", "scale"))
 
# Transform the data using the estimated parameters

train.transformed <- preproc.param %>% predict(train.data)

test.transformed <- preproc.param %>% predict(test.data)
 
# define rda models

model = rda(Species ~. , data= train.transformed)
model
 
# run the model on test data and generate the prediction

predictions <- model %>% predict(test.transformed)
# calculate model accuracy

mean(predictions$class==test.transformed$Species)

Output:

Call: 
rda(formula = Species ~ ., data = train.transformed)

Regularization parameters: 
      gamma      lambda 
0.002619109 0.222244278 

Prior probabilities of groups: 
    setosa versicolor  virginica 
 0.3333333  0.3333333  0.3333333 

Misclassification rate: 
       apparent: 1.667 %
cross-validated: 1.667 %

### accuracy
0.9666667

References:

Article Tags :

Machine Learning

ML-Statistics