Open In App

Ensemble Learning with SVM and Decision Trees

Last Updated : 10 Mar, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

Ensemble learning is a machine learning technique that combines multiple individual models to improve predictive performance. Two popular algorithms used in ensemble learning are Support Vector Machines (SVMs) and Decision Trees.

What is Ensemble Learning?

By merging many models (also referred to as “base learners” or “weak learners”), ensemble learning is a machine learning approach that creates a stronger model that is referred to as an “ensemble model.” The concept of ensemble learning is based on the premise that an ensemble model may frequently outperform any individual model in the ensemble by aggregating the predictions of numerous models.

What are decision trees?

A decision tree is a tree-like structure where:

  1. Each internal node represents a “test” on an attribute (e.g., whether a feature is greater than a certain threshold).
  2. Each branch represents the outcome of the test.
  3. Each leaf node represents a class label (in classification) or a continuous value (in regression).

What are support vector machines?

Support Vector Machines (SVMs) are supervised learning models used for classification and regression tasks. In classification, SVMs find the hyperplane that best separates different classes in the feature space. This hyperplane is chosen to maximize the margin, which is the distance between the hyperplane and the nearest data point from each class, also known as support vectors.

How to combine Support Vector Machines (SVM) and Decision Trees?

Here are some common approaches to how to combine Support Vector Machines (SVM) and Decision Trees :

  1. Bagging (Bootstrap Aggregating): This involves training multiple SVMs or Decision Trees on different subsets of the training data and then combining their predictions. This can reduce overfitting and improve generalization.
  2. Boosting: Algorithms like AdaBoost can be used to combine multiple SVMs or Decision Trees sequentially, with each subsequent model focusing on the mistakes of the previous ones. This can improve the overall performance of the combined model.
  3. Random Forests: This ensemble method combines multiple Decision Trees trained on random subsets of the features, and optionally, the samples. It can be effective for both classification and regression tasks.
  4. Cascade SVM: This approach involves using a Decision Tree to pre-select samples that are then fed into separate SVM classifiers. This can be useful when the dataset is large and the SVM training is computationally expensive.
  5. SVM as feature selector for Decision Trees: Use the SVM to select the most relevant features from the dataset, and then train a Decision Tree on the selected features. This can help improve the interpretability of the Decision Tree and reduce the impact of irrelevant features.
  6. Stacking: Train multiple SVMs and Decision Trees separately on the dataset and then use another model (e.g., a linear regression or another Decision Tree) to combine their predictions. This can often lead to better performance than any individual model.

Implementation of Support Vector Machines (SVM) using Decision Trees

In this implementation, we have set up to use a Voting Classifier with a Support Vector Machine (SVM) and a Decision Tree (DT) as base estimators for the breast cancer dataset.

Importing Necessary Libraries

Python3




from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier


Loading and splitting the dataset

Python3




# Load the breast cancer dataset
breast_cancer = load_breast_cancer()
X_bc, y_bc = breast_cancer.data, breast_cancer.target
 
# Split the dataset into training and test sets
X_train_bc, X_test_bc, y_train_bc, y_test_bc = train_test_split(X_bc, y_bc, test_size=0.2, random_state=42)


Creating Base Estimators

  • SVC (Support Vector Classifier): The probability=True parameter allows the model to predict probabilities for each class, which is necessary for soft voting in the VotingClassifier.
  • DecisionTreeClassifier: This classifier creates a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. Each internal node represents a “test” on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label. In the context of the VotingClassifier, the decision tree serves as another base estimator for voting.

Python3




# Create the base estimators
svm_bc = SVC(probability=True)
dt_bc = DecisionTreeClassifier()


Ensemble Learning

  • VotingClassifier creation: The VotingClassifier is created with estimators=[('svm', svm_bc), ('dt', dt_bc)], specifying the list of base estimators to be used for the voting. The voting='soft' parameter indicates that the classifier will use soft voting, which means it predicts the class label based on the argmax of the sums of the predicted probabilities.
  • Training the voting classifier: The fit method is called on the voting_clf_bc object with the training data X_train_bc and y_train_bc to train the classifier on the breast cancer dataset.

Python3




# Create the voting classifier
voting_clf_bc = VotingClassifier(estimators=[('svm', svm_bc), ('dt', dt_bc)], voting='soft')
 
# Train the voting classifier
voting_clf_bc.fit(X_train_bc, y_train_bc)


Evaluation of the Model

  • Making predictions: The predict method is called on the voting_clf_bc object with the test data X_test_bc to make predictions for the breast cancer dataset.
  • Evaluating accuracy: The accuracy_score function is used to compare the predicted labels y_pred_bc with the actual labels y_test_bc from the test set. The accuracy is then printed to the console using f-string formatting.

Python3




# Make predictions
y_pred_bc = voting_clf_bc.predict(X_test_bc)
 
# Evaluate the accuracy
accuracy_bc = accuracy_score(y_test_bc, y_pred_bc)
print(f'Accuracy on breast cancer dataset: {accuracy_bc}')


Output:

Accuracy on breast cancer dataset: 0.9385964912280702



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads