Boosting in R

Last Updated : 11 Jul, 2023

Boosting is a machine learning technique used to improve the performance of predictive models by combining weak models into a strong ensemble. R is a popular language for implementing boosting algorithms, providing several packages for this purpose. Machine learning models are gaining popularity due to their ability to solve complex issues. Boosting is one such technique that combines numerous weak models to create a strong model capable of making accurate predictions.

In this article, we will look at the idea of boosting and how it can be used in R. Boosting is a machine-learning method that combines several weak models to produce a strong model. It differs from other techniques such as bagging and random forests in that it employs a weighted strategy to combine weak models. Boosting algorithms work by iteratively improving the weak models’ predictions and then using those gains to train the next weak model. All of the weak models are combined to form the end model.

Importing the Data: Importing data is the first stage in implementing boosting. In R Programming Language, you can extract data from a CSV file using the read.csv function, or you can import data from other file formats using other functions.
Splitting the Data: Following that, divide the data into training and testing groups. This is required to prevent the algorithm from overfitting the training data.
Building the Model: The following stage is to create the boosting model in R using the gbm function. The function accepts three arguments: the formula, the data, and the number of trees to create. Other parameters, such as the utmost depth of each tree and the learning rate, can also be specified.
Tuning the Model: After building the model, you must tune it to improve its performance. This involves tweaking the parameters of the model and testing its accuracy on the validation data.
Evaluating the Model: Finally, you need to evaluate the performance of the model using various metrics like accuracy, precision, recall, and F1 score.

Gradient Boosting Machine in R

First, we will load the dataset and then split the dataset into training and testing sets. Again, use a boosting model for the training data using the gbm() function from the “gbm” package. We will use the “boosting” method and set the number of trees to 1000 and the learning rate to 0.01. Use the model to predict the mpg of the test data. Finally, evaluate the performance of the model using the mean squared error:

R

# Load the mtcars dataset
data(mtcars)
 
# Split the dataset into training and testing sets
library(caTools)
set.seed(123)
split <- sample.split(mtcars$mpg, SplitRatio = 0.7)
train <- mtcars[split, ]
test <- mtcars[!split, ]
 
# Fit a boosting model to the training data
library(gbm)
boost <- gbm(mpg ~ ., data = train,
             distribution = "gaussian",
             n.trees = 1000, shrinkage = 0.01,
             interaction.depth = 4,
             bag.fraction = 0.7,
             n.minobsinnode = 5)
 
# Use the model to predict the mpg of the test data
predictions <- predict(boost, newdata = test)
 
# Evaluate the performance of the model
# using the mean squared error
mse <- mean((test$mpg - predictions)^2)
mse

Output:

Using 1000 trees...
10.8390088276318

AdaBoost Model in R

Here’s an example of how to use AdaBoost in R to classify the iris dataset:

R

library(adabag)
 
# Load the iris dataset
data(iris)
 
# Convert the Species column to a factor
iris$Species <- as.factor(iris$Species)
 
# Split the data into training and testing sets
index <- sample(nrow(iris), nrow(iris) * 0.7)
train <- iris[index, ]
test <- iris[-index, ]
 
# Fit the AdaBoost model using decision
# trees as base learners
model <- boosting(Species ~ ., data = train,
                  boos = TRUE, mfinal = 10,
                  control = rpart.control(cp = 0.01,
                                          minsplit = 3))
 
# Make predictions on the test set
predictions <- predict(model, newdata = test)
 
# Calculate the confusion matrix
confusion_matrix <- table(predictions$class, test$Species)
 
# Calculate the accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
 
# Print the confusion matrix and accuracy
print(confusion_matrix)
print(paste0("Accuracy: ", accuracy))

Output:

             setosa versicolor virginica
  setosa         21          0         0
  versicolor      0          8         1
  virginica       0          1        14
[1] "Accuracy: 0.955555555555556"

XGBoost Model in R

Here’s an example of how to use XGBoost in R.

R

# Load the required library
library(xgboost)
library(pROC)
 
# Load the data
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
 
# Convert the data into DMatrix format
dtrain <- xgb.DMatrix(agaricus.train$data,
                      label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, 
                     label = agaricus.test$label)
 
# Set the parameters
params <- list(max_depth = 2, 
               objective = "binary:logistic", 
               eval_metric = "error")
 
# Train the model
xgb_model <- xgb.train(params=params,
                       data = dtrain, nrounds=25,
                       watchlist=list(train=dtrain,
                                      test=dtest),
                       verbose = FALSE)
 
# Predict on test data
pred <- predict(xgb_model, dtest)
auc(agaricus.test$label, pred)