Open In App

Random Forest Approach in R Programming

Random Forest in R Programming is an ensemble of decision trees. It builds and combines multiple decision trees to get more accurate predictions. It’s a non-linear classification algorithm. Each decision tree model is used when employed on its own. An error estimate of cases is made that is not used when constructing the tree. This is called an out of bag error estimate mentioned as a percentage.

They are called random because they choose predictors randomly at a time of training. They are called forest because they take the output of multiple trees to make a decision. Random forest outperforms decision trees as a large number of uncorrelated trees(models) operating as a committee will always outperform the individual constituent models.



Theory

Random forest takes random samples from the observations, random initial variables(columns) and tries to build a model. Random forest algorithm is as follows:

Example:
Consider a Fruit Box consisting of three fruits Apples, Oranges, and Cherries in training data i.e n = 3. We are predicting the fruit which is maximum in number in a fruit box. A random forest model using the training data with a number of trees, k = 3.



The model is judged using various features of data i.e diameter, color, shape, and groups. Among orange, cheery, and orange, orange is selected to be maximum in fruit box by random forest.

The Dataset

Iris dataset consists of 50 samples from each of 3 species of Iris(Iris setosa, Iris virginica, Iris versicolor) and a multivariate dataset introduced by British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems. Four features were measured from each sample i.e length and width of the sepals and petals and based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.




# Loading data
data(iris)
  
# Structure 
str(iris)

Performing Random Forest on dataset

Using random forest algorithm on the dataset which includes 11 persons and 6 variables or attributes.




# Installing package
install.packages("caTools")       # For sampling the dataset
install.packages("randomForest"# For implementing random forest algorithm
  
# Loading package
library(caTools)
library(randomForest)
  
# Splitting data in train and test data
split <- sample.split(iris, SplitRatio = 0.7)
split
  
train <- subset(iris, split == "TRUE")
test <- subset(iris, split == "FALSE")
  
# Fitting Random Forest to the train dataset
set.seed(120# Setting seed
classifier_RF = randomForest(x = train[-5],
                             y = train$Species,
                             ntree = 500)
  
classifier_RF
  
# Predicting the Test set results
y_pred = predict(classifier_RF, newdata = test[-5])
  
# Confusion Matrix
confusion_mtx = table(test[, 5], y_pred)
confusion_mtx
  
# Plotting model
plot(classifier_RF)
  
# Importance plot
importance(classifier_RF)
  
# Variable importance plot
varImpPlot(classifier_RF)

Output:

So, random forest is a powerful algorithm used for classification in the industry.


Article Tags :