Prerequisite: ML | Binning or Discretization Binning method is used to smoothing data or to handle noisy data. In this method, the data is first sorted and then the sorted values are distributed into a number of buckets or bins. As binning methods consult the neighbourhood of values, they perform local smoothing. There are three approaches to performing smoothing –
Smoothing by bin means : In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. Smoothing by bin median : In this method each bin value is replaced by its bin median value. Smoothing by bin boundary : In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value.
Approach:
- Sort the array of a given data set.
- Divides the range into N intervals, each containing the approximately same number of samples(Equal-depth partitioning).
- Store mean/ median/ boundaries in each row.
Examples:
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Partition using equal frequency approach:
- Bin 1 : 4, 8, 9, 15
- Bin 2 : 21, 21, 24, 25
- Bin 3 : 26, 28, 29, 34
Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Smoothing by bin median:
- Bin 1: 9 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
Below is the Python implementation for the above algorithm –
Python3
import numpy as np
import math
from sklearn.datasets import load_iris
from sklearn import datasets, linear_model, metrics
dataset = load_iris()
a = dataset.data
b = np.zeros( 150 )
for i in range ( 150 ):
b[i] = a[i, 1 ]
b = np.sort(b)
bin1 = np.zeros(( 30 , 5 ))
bin2 = np.zeros(( 30 , 5 ))
bin3 = np.zeros(( 30 , 5 ))
for i in range ( 0 , 150 , 5 ):
k = int (i / 5 )
mean = (b[i] + b[i + 1 ] + b[i + 2 ] + b[i + 3 ] + b[i + 4 ]) / 5
for j in range ( 5 ):
bin1[k,j] = mean
print ( "Bin Mean: \n" ,bin1)
for i in range ( 0 , 150 , 5 ):
k = int (i / 5 )
for j in range ( 5 ):
if (b[i + j] - b[i]) < (b[i + 4 ] - b[i + j]):
bin2[k,j] = b[i]
else :
bin2[k,j] = b[i + 4 ]
print ( "Bin Boundaries: \n" ,bin2)
for i in range ( 0 , 150 , 5 ):
k = int (i / 5 )
for j in range ( 5 ):
bin3[k,j] = b[i + 2 ]
print ( "Bin Median: \n" ,bin3)
|
Whether you're preparing for your first job interview or aiming to upskill in this ever-evolving tech landscape,
GeeksforGeeks Courses are your key to success. We provide top-quality content at affordable prices, all geared towards accelerating your growth in a time-bound manner. Join the millions we've already empowered, and we're here to do the same for you. Don't miss out -
check it out now!
Last Updated :
13 Apr, 2022
Like Article
Save Article