Open In App

Voting in Machine Learning

Last Updated : 08 Jan, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

What is Sklearn?

Scikit-learn also known as Sklearn is a machine-learning package for Python. The name Sklearn is derived from the SciPy Toolkit. Sklearn is built on NumPy, SciPy, and Matplotlib and has two major implications :

  • Sklearn is very fast and efficient.
  • It often prefers working with arrays.

Advantages of using sklearn

  1. Incredible documentation
  2. Variety – In terms of ML, Sklearn is the leading package concerning variety such as Regression, Classification, Clustering, Support vector machines, and Dimensionality reduction.
  3. Numerical stability – Sklearn is famously numerical stable. The basic idea is that as we know training an algorithm is about performing complicated mathematical operations in a background, so when the numbers you are dealing with are too small or too big, then your code may break that’s why numerical stability is important.

The Voting in sci-kit-learn (Sklearn) allows us to combine multiple machine-learning modules and use a majority vote or a weighted vote to make predictions. It’s a way to ensemble different models for potentially better performance.

The Voting in Sklearn is an ensemble method that combines multiple individual classifiers or regressors to make predictions. It supports both hard and soft voting strategies. Now let’s understand what is ensemble learning.

Ensemble learning is a powerful technique in machine learning where multiple models are combined to improve overall predictive performance and robustness. The fundamental idea is that by aggregating the predictions of diverse models, the weaknesses of individual models can be mitigated, leading to more accurate and reliable results.

One popular form of ensemble learning is the Voting Classifier. This approach involves training multiple base models independently and then combining their predictions through a voting mechanism. The aggregated decision, whether by majority vote or weighted voting, often yields better generalization and predictive performance than any individual model.

The scikit-learn library provides a convenient implementation of the Voting Classifier, allowing us to easily integrate and experiment with different models in a unified framework. This approach is particularly useful when dealing with a variety of data patterns and ensures a more robust prediction, making it a valuable tool in the machine learning practitioner’s toolkit.

Concepts related to Sklearn’s Voting

Let’s understand the concepts that provides a foundation for effectively applying ensemble learning techniques, like the Voting Classifier or the Voting Regressor, in various machine learning scenarios.

Voting Strategies:

  • Hard Voting – The class that receives the majority of votes is selected as the final prediction. It is commonly used in classification problems. In regression, it predicts the average of the individual predictions.
  • Soft Voting – Weighted average of predicted probabilities is used to make the final prediction. It is suitable when classifiers provide probability estimates. In other words, for each class, it sums the predicted probabilities and predicts the class with the highest sum.

Base Models: Individual models that form the ensemble. For example, Support Vector Machines, Logistic Regression, Decision Trees.

Classifier and Regressor Variants:

  • Voting Classifier – Combines multiple classifiers for classification tasks.
  • Voting Regressor – Combines multiple regressors for regression tasks.

Model Diversity: The extent to which individual models in an ensemble are different from each other. Diversity is crucial for improving overall ensemble performance.

Bagging and Boosting:

  • Bagging (Bootstrap Aggregating): Constructs multiple models in parallel with different subsets of the training dataset.
  • Boosting: Constructs models sequentially, giving more weight to instances misclassified by the previous models.

Random Forest: An ensemble learning method that constructs a multitude of decision trees and merges them to get a more accurate and stable prediction.

Cross-Validation: A technique to assess how well a model will generalize to an independent dataset by partitioning the training data into subsets.

Hyperparameter Tuning: The process of finding the best set of hyperparameters for a model to optimize its performance. It is crucial for enhancing the effectiveness of individual base models within the ensemble.

Voting Classifier

It is a class in scikit-learn that implements the ensemble voting strategy. It takes a list of base models and combines their predictions based on the specified voting strategy.

Steps:

To implement a ‘Voting Classifier’ using scikit-learn or sklearn, we can follow these general steps: These steps provide a basic outline for implementing a ‘Voting Classifier’.

  1. Import Necessary Libraries
  2. Load the Dataset
  3. Choose Base Models and Instantiate Base Models
  4. Create the ‘VotingClassifier’, adjust the ‘voting’ parameter to ‘hard’ if you want to use hard voting.
  5. Fit the Model
  6. Make Predictions
  7. Evaluate Performance, adjust the evaluation metric based on your problem (e.g., precision, recall, F1-score).
  8. Hyperparameter Tuning (Optional), tune hyperparameters of individual models or the ensemble itself for better performance.
  9. Experiment and Iterate(Optional), adjust the base models, voting strategy, and other parameters based on the characteristics of your data and the performance you observe.

Pre-requisites:

Make sure that scikit-learn is installed in your Python environment. You can install scikit-learn using the following command in your terminal or command prompt:

pip install scikit-learn

Code Implementation

Now we will apply this steps and implement Sklearn’s Voting through an example.

Refer to the below python code which is a simple example using the popular Iris dataset. In this example, we are using a combination of Logistic Regression, Decision Tree, and Support Vector Machine Models. The ‘Voting Classifier’ combines their predictions using soft voting.

Python




# Step 1 - Importing libraries
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
 
# Step 2 - Loading Iris dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42)
 
# Step 3 - Define base models
model1 = LogisticRegression()
model2 = DecisionTreeClassifier()
model3 = SVC(probability=True)
 
# Step 4 - Creating a VotingClassifier with soft voting
voting_classifier = VotingClassifier(
    estimators=[('lr', model1), ('dt', model2), ('svc', model3)], voting='soft')
 
# Step 5 - Fit the model
voting_classifier.fit(X_train, y_train)
 
# Step 6 - Make predictions
y_prediction = voting_classifier.predict(X_test)
 
# Step 7 - Evaluating accuracy
accuracy = accuracy_score(y_test, y_prediction)
print('Accuracy:', accuracy)


Output:

Accuracy: 0.9666666666666667

It means that the accuracy of the model is 96.67%. The code uses the Iris dataset to train a Voting Classifier with Logistic Regression, Decision Tree Classifier, and Support Vector Machine as base models.

Voting Regressor

The Voting Regressor in scikit-learn is an ensemble method used for regression tasks. It combines the predictions from multiple base regression models to produce a more robust and accurate final prediction. Similar to the ‘Voting Classifier’ for classification the ‘Voting Regressor’ supports both hard and soft voting which means that the voting can be done either through a simple averaging of the base model’s predictions (hard voting) or by weighing the predictions based on the confidence of each model (soft voting).

Steps with an example:

These steps provide a basic outline for implementing a ‘Voting Regressor’. Let’s break down this into code snippets for using the ‘Voting Regressor’ with an example:

Note – Please make sure that sklearn is installed in your Python environment for importing required libraries. If not, then use this command in your terminal or command prompt to install scikit-learn

pip install scikit-learn

1. Import necessary libraries:

In this step, we will import the required libraries, including the modules for loading datasets (‘load_boston’), splitting data (‘train_test_split’), ensemble learning (‘VotingRegressor’) and several base regression models (‘LinearRegression’, ‘DecisionTreeRegressor’, ‘RandomForestRegressor’), as well as a metric for evaluation that means performance metrics (‘mean_squared_error’).

Python




# Import libraries
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error


2. Load the Dataset

Here, we will load the Boston Housing dataset using ‘load_boston()’ and split it into training and testing sets with 80% for training and 20% for testing. The ‘random_state’ parameter is used in scikit-learn’s functions that involve randomness to ensure reproducibility. We will set random state to a specific number, such as 42, it makes the randomness predictable. In ML, it is often used when splitting datasets into training and testing sets to ensure that the same data split is obtained each time the code is run. The number 42 itself is arbitrary; you could use any integer value, it is used to ensure that the same split of the data into training and tests is obtained whenever this code executed, this can be important for debugging and testing.

Python




# Load Boston Housing dataset
boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(
    boston.data, boston.target, test_size=0.2, random_state=42)


3. Choose base regressor models

Now we will define three base regression models: Linear regression, decision tree regressor, and random forest regressor.

Python




# Choose and define base regressor models
model1 = LinearRegression()
model2 = DecisionTreeRegressor()
model3 = RandomForestRegressor()


4. Create the Voting Regressor

Here, we created a VotingRegressor instance, specifying the base models and their corresponding names. The estimators parameter takes a list of tuples contains a name and a regression model.

Python




# Create a VotingRegressor
voting_regressor = VotingRegressor(
    estimators=[('lr', model1), ('dt', model2), ('rf', model3)])
# lr - for Linear Regression, dt - for Decision Tree regressor, rf - for Random Forest regressor


5. Fit the model

Next step is to fit the VotingRegressor on the training data – X_train and y_train.

Python




# Fit the model
voting_regressor.fit(X_train, y_train)


6. Make predictions

We will make predictions on the test set(X_test) using the trained VotingRegressor.

Python




# Make predictions on the test set
y_prediction = voting_regressor.predict(X_test)


7. Evaluate performance

Finally, we will evaluate the performance of the VotingRegressor using the Mean Squared Error metric and print the result.

Python




#Evaluate performance using Mean Squared Error (mse)
mse = mean_squarred_error(y_test, y_prediction)
print(f'Mean Squared Error: {mse: .2f}')


Output:

Mean Squared Error: 14.74

This example uses a ‘VotingRegressor’ with Linear Regression, Decision Tree Regressor, and Random Forest Regressor as base models. The final prediction is a weighted average of the predictons from these base models.

This approach allows to leverage the strengths of different regression models and potentially achieve better predictive performance compared to the use of individual models alone.

This example code loads the Boston Housing dataset and splits it into training and testing sets. It then defines three base regressor models: LinearRegression, DecisionTreeRegressor, and RandomForestRegressor. These models are then combined into a VotingRegressor object. The VotingRegressor is then fit on the training data and used to make predictions on the test set. And then finally, the performance of the model is evaluated using the Mean Squared Error metric, which is calculated to be 14.74 when we run the code.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads