Data binning, bucketing is a data pre-processing method used to minimize the effects of small observation errors. The original data values are divided into small intervals known as bins and then they are replaced by a general value calculated for that bin. This has a smoothing effect on the input data and may also reduce the chances of overfitting in the case of small datasets
There are 2 methods of dividing data into bins:
- Equal Frequency Binning: bins have an equal frequency.
- Equal Width Binning : bins have equal width with a range of each bin are defined as [min + w], [min + 2w] …. [min + nw] where w = (max – min) / (no of bins).
Equal frequency:
Input:[5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
Output:
[5, 10, 11, 13]
[15, 35, 50, 55]
[72, 92, 204, 215]
Equal Width:
Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
Output:
[5, 10, 11, 13, 15, 35, 50, 55, 72]
[92]
[204, 215]
Code : Implementation of Binning Technique:
Python
def equifreq(arr1, m):
a = len (arr1)
n = int (a / m)
for i in range ( 0 , m):
arr = []
for j in range (i * n, (i + 1 ) * n):
if j > = a:
break
arr = arr + [arr1[j]]
print (arr)
def equiwidth(arr1, m):
a = len (arr1)
w = int (( max (arr1) - min (arr1)) / m)
min1 = min (arr1)
arr = []
for i in range ( 0 , m + 1 ):
arr = arr + [min1 + w * i]
arri = []
for i in range ( 0 , m):
temp = []
for j in arr1:
if j > = arr[i] and j < = arr[i + 1 ]:
temp + = [j]
arri + = [temp]
print (arri)
data = [ 5 , 10 , 11 , 13 , 15 , 35 , 50 , 55 , 72 , 92 , 204 , 215 ]
m = 3
print ( "equal frequency binning" )
equifreq(data, m)
print ( "\n\nequal width binning" )
equiwidth(data, 3 )
|
Output :
equal frequency binning
[5, 10, 11, 13]
[15, 35, 50, 55]
[72, 92, 204, 215]
equal width binning
[[5, 10, 11, 13, 15, 35, 50, 55, 72], [92], [204, 215]]