Open In App

Demonstrating the different Strategies of KBinsDiscretizer in Scikit Learn

Last Updated : 28 Nov, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Whenever we look at a dataset on which we are required to apply machine learning algorithms, we often see different types of values corresponding to different features present. Some of them are categorical, such as features containing “1, 2, 3” or “True or False”, while others are continuous in values, such as the blood pressure of patients, which can take a range of values. As the data is often collected without much consideration for the structure or format of the data, it can present challenges for those tasked with analyzing and interpreting it. Whenever we face a situation like this it is often considered good to convert continuous values to discrete when using machine learning algorithms that perform better on categorical data. The performance of some machine learning algorithms may be affected due to the non-standard probability distribution of features containing continuous values. This is where ‘KBinsDiscretizer’ comes into the picture.

KBinsDiscretizer

‘KBinsDiscretizer’ is a data preprocessing technique of the sklearn library that helps in converting continuous value data into bins and encoding those bins to create discrete values. This can be really helpful in creating machine learning models that work on discrete data rather than continuous data. ‘KBinsDiscretizer’ actually makes an algorithm work that gives in return the bin edges according to a ‘strategy’ parameter. We initialize the ‘KBinsDiscretizer‘ first with different values of its parameters, and then after initializing it we fit in the data that we want to transform, after fitting in the data the algorithm gives the bin edges, and when the bin edges get determined the continuous data is transformed into bins of data. ‘KBinsDiscretizer’ is essential in data preprocessing as it may improve our overall machine-learning model performance. At last, the binned data is encoded according to the encoded parameter about which we are going to talk next.

‘KBinsDiscretizer’ takes a number of parameters that we are going to discuss now.

Concepts Related to KBinsDiscretizer

  1. Binning: It is the process of converting continuous data over a range into bins of data, this reduces the observation errors in the dataset.
  2. Bin Edge: Bin edge is the boundary between two bins, it separates different bins from one another into intervals in which continuous data is discretized.
  3. n_bins: n_bins is a parameter of ‘KBinsDiscretizer’ which specifies the number of bins that we want to create, but we must set this in our mind that the choice of n_bins must be set according to the strategy that we are going to use or how we want our data to be represented as.
  4. strategy: strategy is also a parameter which defines the span of the bins we are creating. The strategy parameter can have three values: {‘uniform’, ‘quantile’, ‘kmeans’} and the default value of this parameter is set to ‘quantile’. The ‘uniform’ value of strategy states that the width or span of all the bins are equal, ‘quantile’ value describes the distribution of equal number of data points in each bins we have created and the ‘kmeans’ value specifies that the the bins are made out of kmeans clustering algorithm.
  5. encode: encode is a parameter which we can choose as our choice to encode the generated bins, this parameter helps in converting a bin to a categorical value. This parameter can take three different values: {‘onehot’, ‘onehot-dense’, ‘ordinal’} its default value is set to ‘onehot’. ‘onehot’ encodes the bin values into a sparse matrix with the help of one hot encoding, ‘onehot-dense’ does the same but it converts the bin values to a dense array whereas ‘ordinal’ assigns integer to bins.

Strategy Parameter of KBinsDiscretizer

The strategy parameter of the ‘KBinsDiscretizer’ is useful in determining how the data is going to be divided into discrete bins. The strategy parameter defines the bin edges which gives us the understanding of bin width. Each value of strategy has a unique way of binning data, lets discuss different strategy parameter values of KBinsDiscretizer in a little detail:

  1. Uniform: When strategy parameter is taken as ‘uniform’ the dataset is divided into bins of equal width, where the distribution of data points is not given consideration. The number of bins in which the data is divided is equal to n_bins value that we set while initializing ‘KBinsDiscretizer’. This strategy is simple and easy to understand but it might not work on non uniform distribution of data, as the data points are distributed with abnormality some bins may contain whole lot of points while other with a little to none data points.
  2. Quantile: If we set the value of the strategy parameter to be ‘quantile’ the data is divided into bins with equal number of data points. If we do not specify the strategy while initiating ‘KBinsDiscretizer’ it assumes strategy to be quantile. Here the width of the bins might be different but each bin will contain nearly equal number of data points. It is really helpful in working with skewed data distribution as the data in each bin is equal. In this strategy each bin will have equal frequency of data points.
  3. Kmeans: Here kmeans clustering is used to determine bin’s width. n_bins number of clusters are made which divides the dataset into n_bins number of bins(or clusters). Data points are assigned to each clusters through Kmeans clustering algorithm. This strategy is well fit for the data distribution that follows clustering of data in it. The n_bins value signifies total number of clusters that are going to be created and the bin edges are the edges present in between these clusters.

Steps to Demonstrate the different strategies of KBinsDiscretizer

  1. First we import ‘KBinsDiscretizer’ from sklearn -> sklearn.preprocessing import KBinsDiscretizer, if sklearn is not installed it could be installed by giving the command -> pip install sklearn.
  2. Now we initialize and assign ‘KBinsDiscretizer’ to a variable with lets say n_bins = 3, encode=’ordinal’ and strategy=’uniform’ -> discretizer = KBinsDiscretizer(n_bins = 3, encode=’ordinal’, strategy=’uniform’)
  3. Now we fit the data to the discretizer variable, lets say data is stored in the data variable, therefore -> binned_data = discretizer.fit_transform(data)
  4. The binned_data variable represents the required data or the data after binning has happened.

Python3




# code
# Importing Libraries
from sklearn.preprocessing import KBinsDiscretizer
import numpy as np
 
# Creating a mock data
data = np.array([[1, 3, 5], [2, 7, 9], [4, 6, 8]])
 
# Initializing and setting the parameters n_bins to 3, encode to ordinal and strategy to uniform
discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
 
# Fitting and transforming the data into n_bins number of bins
binned_data = discretizer.fit_transform(data)
 
# Printing binned data to check the change
print(binned_data)


Output:

[[0. 0. 0.]
[1. 2. 2.]
[2. 2. 2.]]

Implementation of All The Strategies of KBinsDiscretizer in Scikit Learn

Here we will be implementing all the strategies of KBinsDiscretizer in Sciket Learn to demonstrate how each strategy discretizes data.

Python




# code
# importing important libraries
from sklearn.preprocessing import KBinsDiscretizer
import numpy as np
import warnings
warnings.filterwarnings('ignore')
 
# creating a sample dataset
data = np.array([[1, 2, 63], [4, 5, 9], [7, 8, 0], [7, 5, 21], [8, 6, 4]])
 
 
# setting the n_bins value to 5
n_bins = 5
 
# initializing KBinsDiscretizer with n_bins=5, encode='ordinal' and different strategies
# strategy: 'uniform'
uniform_discretizer = KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy='uniform')
uniform_bins = uniform_discretizer.fit_transform(data)
 
# strategy: 'quantile'
quantile_discretizer = KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy='quantile')
quantile_bins = quantile_discretizer.fit_transform(data)
 
# strategy: 'kmeans'
kmeans_discretizer = KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy='kmeans')
kmeans_bins = kmeans_discretizer.fit_transform(data)
 
# printing values for comparison
print("Original Data:\n", data)
print("\nBins using 'uniform' strategy:\n", uniform_bins)
print("\nBins using 'quantile' strategy:\n", quantile_bins)
print("\nBins using 'kmeans' strategy:\n", kmeans_bins)


Output:

Original Data:
[[ 1 2 63]
[ 4 5 9]
[ 7 8 0]
[ 7 5 21]
[ 8 6 4]]
Bins using 'uniform' strategy:
[[0. 0. 4.]
[2. 2. 0.]
[4. 4. 0.]
[4. 2. 1.]
[4. 3. 0.]]
Bins using 'quantile' strategy:
[[0. 0. 4.]
[1. 2. 2.]
[3. 4. 0.]
[3. 2. 3.]
[4. 3. 1.]]
Bins using 'kmeans' strategy:
[[0. 0. 4.]
[1. 1. 2.]
[3. 4. 0.]
[3. 1. 3.]
[4. 3. 1.]]

Explanation of the Above Code:

  • First we import all the required libraries here importing numpy and KBinsDiscretizer from sklearn.preprocessing would be the main libraries and the warnings is included to ignore big output warnings which are not affecting our code.
  • Then we take a sample dataset as a numpy matrix with shape (5, 3) in the above example.
  • After these steps we set the n_bins value to required 5 (which is just selected randomly we can choose any other number as well) since we are going to apply it as a parameter multiple times.
  • Now we assign variables to KBinsDiscretizer for each of the different strategies keeping the encode and n_bins fixed as ‘ordinal’ and 5 respectively.
  • After assigning we fit and transform the sample data to get binned data out of the process.
  • After getting the binned data for different strategies we print the data to compare how the original data is changed.

Implementation of All The Strategies of KBinsDiscretizer Using Iris Dataset

Here we will be using the iris dataset to determine the the accuracy of species prediction with and without discretization of data. Iris dataset contains parameters which we can use to determine the flower specie which is given to us. The parameters include: Sepal Length in cm, Sepal Width in cm, Petal Length in cm, Petal Width in cm and the output which we have to classify is the column named ‘Species’ which will tell us about the specie of iris the given flower information belongs to. We will be using discretization on just two features Sepal Length in cm and Petal Length in cm to show the effectiveness of discretization in model accuracy. We will be modelling our dataset with the help of Decision Tree Classifier which often works phenomenally on discrete data rather than continuous data. We will write code for the same using ‘sklearn’ library and visualize the continuous data and the binned data. The link to the iris dataset is present here – link.

Importing Libraries

Python3




import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import KBinsDiscretizer


This code transforms a two-dimensional dataset loaded from the Iris dataset using the KBinsDiscretizer from scikit-learn. Different discretization strategies (uniform, quantile, and kmeans) are applied to the data, and the effects of these strategies on the distribution of the data are seen in three distinct subplots using contour plots.

KBinsDiscretizer Strategies

Python3




strategies = ["uniform", "quantile", "kmeans"]
 
# Load the Iris dataset
iris_data = load_iris()
# Select the first two features for visualization
X_data = iris_data.data[:, :2]


Three discretization strategies are defined by this code: “uniform,” “quantile,” and “kmeans.” After loading the Iris dataset, it chooses to visualize the first two features, which are kept in the X_data variable. The following code will discretize these features according to the prescribed strategies, enabling comparisons of the effects of various discretization techniques on the distribution of data.

Discretization Strategies Visualization

Python3




# Create a figure with subplots for each strategy
plt.figure(figsize=(15, 4))
 
# Generate mesh grids for contour plots
x_vals, y_vals = np.meshgrid(
    np.linspace(X_data[:, 0].min(), X_data[:, 0].max(), 100),
    np.linspace(X_data[:, 1].min(), X_data[:, 1].max(), 100),
)
grid = np.c_[x_vals.ravel(), y_vals.ravel()]
 
# Loop through each strategy
for i, strategy in enumerate(strategies, 1):
    # Create a KBinsDiscretizer with the chosen strategy
    discretizer = KBinsDiscretizer(
        n_bins=4, encode="ordinal", strategy=strategy)
    discretizer.fit(X_data)
    grid_encoded = discretizer.transform(grid)[:, 0].reshape(x_vals.shape)
 
    # Create subplots for each strategy
    plt.subplot(1, len(strategies), i)
    plt.contourf(x_vals, y_vals, grid_encoded, alpha=0.5)
    plt.scatter(X_data[:, 0], X_data[:, 1], edgecolors="k")
    plt.xlim(x_vals.min(), x_vals.max())
    plt.ylim(y_vals.min(), y_vals.max())
    plt.xticks(())
    plt.yticks(())
    plt.title("strategy='%s'" % (strategy,), size=14)
 
# Adjust layout and display the subplots
plt.tight_layout()
plt.show()


Output:

kbin-(1)-Geeksforgeeks

Different Strategies of KBinsDiscretizer

This code creates a figure with subplots to showcase the impact of three discretization strategies—uniform, quantile, and k-means—applied to the first two features of the Iris dataset. It generates mesh grids for contour plots by defining a set of points within the range of the selected features. The code iterates through each discretization strategy, transforming the mesh grid using KBinsDiscretizer, and displays the resulting contour plot alongside the original scattered data points. The subplots are organized in a row, with adjustments made for proper layout, tick labels, and titles, providing a clear comparison of the effects of different discretization strategies on the data visualization.

Advantages and Disadvantages of using Specific Strategy in ‘KBinsDiscretizer’

Choosing a specific strategy as a parameter in ‘KBinsDiscretizer’ is one of the most important task as it determines how our continuous data is going to be discrete bins of data. Choice of strategy affects how our machine learning model performs, therefore paying attention to the strategy parameter is really important. Here are some of the advantages and disadvantages of using specific strategy in ‘KBinsDiscretizer’:

Advantages:

  • Some strategy may work better on some specific data distribution than other. Therefore, choosing the appropriate strategy according to the present dataset is important.
  • While dealing with skewed data, the value of strategy parameter may be best suited as ‘quantile’ as it converts data into equal data point bins.
  • Choosing the right strategy can improve the model performance, for example if the dataset exhibits clustering characteristics it is best to choose ‘kmeans’ as the value of strategy.
  • Strategies like ‘uniform’ can be reduce the risk of overfitting in certain cases as it creates bins of equal width, it will be helpful in dealing with noisy data.

Disadvantages:

  • Discretization can lead into information loss as the continuous values are converted into discrete set of values. This may affect performance of machine learning algorithms which works on continuous values.
  • We have to select optimal number of bins otherwise the data may get simplified or overfitted.
  • We have to handle outliers before binning the data as the outliers may increase the width of the bins and make them less meaningful.
  • Binning reduces gradient descent information, which makes it hard to models like linear regression make good predictions.

Conclusion

In this demonstration, we looked at the many approaches that the KBinsDiscretizer in Scikit-Learn provides. On a two-dimensional dataset taken from the Iris dataset, the code demonstrated the effects of “uniform,” “quantile,” and “kmeans” techniques. We noticed how these methods discretize the data by using ordinal encoding and making a grid for visualization. The “quantile” approach concentrates on equi-probable quantiles, the “kmeans” approach uses clustering to identify bin boundaries, and the “uniform” strategy divides the data uniformly inside designated bin edges. This comparison demonstrated how the distribution and interpretation of the data are greatly impacted by the discretization approach selected. The KBinsDiscretizer can be more useful in a variety of applications when researchers and data analysts choose the best approach depending on the unique properties of their dataset and analytical objectives.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads